This week, we finished up our 2009 Q1 release of the Intel driver. Most of the effort for this quarter has been to stabilize the recent work, focusing on serious bugs and testing as many combinations as we could manage.
For the last year or so, we’ve been busy rebuilding the driver, adding new ways of managing memory, setting modes and communicating between user space and the kernel. Because all of these changes cross multiple projects (X/Mesa/Linux), we’ve tried to make sure we supported all of the possible combinations. Let’s see what options we’ve got:
User mode. The entire output side of the driver stack is in user mode; all of the output detection, monitor detection, EDID parsing etc. This has some significant limitations, the worst of which is that the kernel has no idea what’s going on, so you cannot show any kernel messages unless the X server relinquishes control of the display. In particular, panic messages are lost to the user. If the X server crashes, the user gets to reboot the machine. A more subtle limitation is that the driver couldn’t handle interrupts, so there wasn’t any hot-plug monitor support. That’s becoming increasingly important as people want hot-plug projector support, and as systems start including DisplayPort, which requires driver intervention when the video cable gets kicked out of the machine.
Kernel mode. All of that code moves into the kernel, where it is exposed both as a part of the DRI interface and also through the frame buffer APIs for use by the fb console or any other frame buffer applications. Lots of benefits here, but the development environment is entirely different from user mode, and so porting the code is a fair bit of work. Dave Airlie has some pie-in-the-sky ideas about making the kernel mode setting code run in user mode by recompiling it with suitable user-mode emulation of the necessary kernel APIs.
None. In this mode, the system doesn’t support direct rendering at all, so all rendering must go through the X server. GL calls are generally implemented as a software rasterizer inside the X server.
DRI1. In this mode, applications share access to a single set of front/back/depth/stencil buffers. They must carefully ensure that all of their drawing operations are clipped to the subset of each buffer that their window occupies. Each application performs their own buffer swapping by copying contents from suitable regions of each buffer. Synchronization with the X server is done through signals and a shared memory buffer. While any application is executing, all other applications are locked out of the hardware, even if they wouldn’t be conflicting.
DRI2. This gives each application private back/depth/stencil buffers; they draw without taking any locks as the kernel mediates access to each object. The real front buffer for each window is owned by the window system, and so requests to draw into it are directed through the window system API, using the DRI2 extension in the case of the X window system. When applications ask to draw to the ‘front’ buffer, they get a fake buffer allocated, which operates almost exactly like the back buffer, except that copies to the ‘real’ front buffer are automatically performed at suitable synchronization points.
X server + Old-style DRI. In this mode, the X server asks for a fixed amount of memory which is then permanently bound to the graphics aperture and treated like the memory on a discrete graphics card. The X server allocates pixmaps from this fixed pool, when that runs out, it uses regular virtual memory. It may move objects back and forth by copying them between the aperture and virtual memory.
This mode also supports direct rendering. While the direct rendering application holds the DRI1 lock (remember that from above?), it has exclusive access to a area within the aperture which is granted to it by the X server. Pages for this area are statically allocated (by the X server). Whenever the application loses the DRI1 lock, any or all of the data stored in those pages may be kicked out, so the application must be prepared to lose the data without notice. For data like textures and vertex lists, which are generated by the application and not (generally) written by the GPU, this works fairly well; the application has a copy of the data already and can re-upload it should it disappear.
GEM. Here, no pages are statically allocated for exclusive use by the graphics system. Instead. individual objects (“buffer objects”, or “bo’s”) are allocated as chunks of pages from regular virtual memory. When not in-use, these objects can be paged out. Furthermore, applications aren’t limited to the graphics aperture space, they allocate from the system pool of virtual memory instead. As objects are used by applications, the kernel dynamically maps them into the graphics aperture.
None. The X server has a complete software 2D rendering system (fb), and if the driver doesn’t provide any accelerated drawing mechanism, the X server can use that software stack to provide all of the necessary drawing operations. While it seems like this would be terribly slow, in reality, it’s not that bad as long as the target rendering surface is present in cached memory, and not accessed through a device in write-combining or uncached modes.
XAA. This is the old XFree86 rendering architecture, and heavily focuses on ‘classic’ X drawing operations, including zero-width lines, core text and even core wide lines and arcs. It does not support accelerated drawing to anything other than the screen or pixmaps which precisely match the screen pixel format. Pixmaps are allocated from a subset of the frame buffer, as if they were actually on the screen. This causes huge problems for chips which have limited 2D addressing abilities (like Intel 8xx-945 and older ATI chips) as they cannot use any memory beyond a single 2D allocation.
EXA. This code was lifted from kdrive, where it was designed as a minimal graphics acceleration architecture for embedded X servers. In that original design, all pixmaps were allocated in system memory as the target systems had essentially no off-screen memory available to the graphics accelerator. In addition, with the goal of bringing up an X server quickly on simple hardware, the only accelerated operations were 2D solid fills and 2D blits. However, one key feature of that code was that it provided a uniform API for drawing with arbitrary pixel formats. As that code was moved into the core X server, it was changed so that pixmaps could be allocated from graphics memory. In addition, acceleration for the Render extension was added so that modern applications could get reasonable performance for anti-aliased text and composited images. However, it can only accelerate rendering to objects stored in graphics memory, and that memory must be pre-allocated by the X server (see the Memory Management section above). Once you run out of that memory, the X server is stuck trying to figure out what to do. This single issue has been the focus of EXA development for the last couple of years — when to move data between virtual memory and graphics memory. Objects in graphics memory are drawn fastest with the GPU and objects in virtual memory can only be drawn with the CPU. The key problem here is that reading data from graphics memory is horribly expensive, so the cost of moving an object from graphics memory to virtual memory is high. When everything is in the right memory space, EXA runs fast. When you start thrashing things around, EXA runs slow.
UXA. Assume your GPU can draw to arbitrary memory. Now assume that EXA’s basic drawing operations are sound, and do a reasonable job of supporting 2D applications (as long as they fit within graphics memory). UXA comes from the combination of these two assumptions — GEM provides the first and the EXA drawing code provides the second. UXA doesn’t need any of the (ugly) pixmap ‘migration’ code because pixmaps never move — they stay in their own little set of pages and the GEM code maps them in and out of the aperture as needed. So, UXA and EXA are not far apart in style or substance, UXA simply skips the parts of EXA which are not necessary in a GEM world.
Pick One From Each Column
Now, many of the above choices can be made independently — you can use User mode setting with DRI1, classic memory management and XAA. Or you can select Kernel mode setting with DRI1, GEM and EXA. With 2 × 3 × 2 × 4 = 48 combinations, you can imagine that:
- Some of them can’t work together
- Some of them haven’t been tested
- Some of them haven’t been tuned for performance
- Some work well on i915, and poorly on 965GM
- Others work well on 965GM and poorly on 855
- None of them (yet) work perfectly well everywhere
Two years ago, you had a lot fewer choices, only user mode setting, none or DRI1 direct rendering, only X server memory management and only none, XAA or EXA acceleration = 12 choices). Even then, choosing between XAA and EXA was quite contentious — EXA would thrash memory badly, while XAA would effectively disable acceleration for pixmaps as soon as it ran out of its (tiny) off-screen space.
In moving towards our eventual goal of a KMS/GEM/DRI2 world, we’ve felt obligated to avoid removing options until that goal worked best for as many people as possible. So, instead of forcing people to switch to brand new code that hasn’t been entirely stable or fast, we’ve tried to make sure that each release of the driver has at least continued to work with the older options.
However, some of the changes we’ve made have caused performance regressions in these older options, which doesn’t exactly make people happy — the old code runs slow, and the new code isn’t quite ready for prime time in all situations. One option here would be to stop shipping code and sit around working on the ‘perfect’ driver, to be released soon after the heat-death of the universe.
Instead, we decided (without much discussion, I’ll have to admit) to keep shipping stuff, make it work as well as we knew how, and engage the community in helping us make this fairly significant transition to our new world order. We did, however, make a very conscious choice to push out new code quickly — getting exposure to real users is often the best way to make sure you’re not making terrible mistakes in the design. The thinking was that users could always switch back to the ‘old’ code if the new code caused problems. Of course, sometimes that ‘old’ code saw fairly significant changes while the new code was integrated…
You can imagine that our internal testing people haven’t been entirely happy with this plan either — our count of bugs has been far too high for far too long, and while we spent the last three months doing nothing but fixing things, it’s still a lot higher than I’d like to see.
Only a few things in the above lists have obvious performance implications — choose XAA and your performance for modern applications will suffer as it offers no acceleration for the Render extension. So, why does switching from EXA to UXA change the performance characteristics of the X server so much? The simple answer is that UXA, GEM and KMS haven’t been tweaked on every platform yet.
For example, hardware rendering performance is affected by how memory is accessed by the drawing engine. There are two ways of mapping pixels, “linear” and “tiled”. In linear mode, pixels are stored in sequential addresses all the way across each scanline, subsequent scanlines are at ever higher addresses. A simple plan, and all of the software rendering code in the X server assumes this model. In tiled mode, rectangular chunks of the screen are stored in adjacent areas in memory, a block of 128x8 pixels forms an ‘X tile’ in the Intel hardware. Drawing to vertically adjacent pixels in this mode means touching the same page, reducing PTE thrashing compared with linear mode. For systems with a limited number of PTEs and limited caches inside the graphics hardware, tiled mode offers tremendous performance improvements. However, getting everything lined up to hit tiled mode is a pain, and on some hardware, in some configurations it doesn’t happen, so you see a huge drop in performance.
Similarly, mapping pages in and out of the GTT sometimes requires that the contents be flushed from CPU or GPU caches. Now, GPU cache flushing isn’t cheap, but we end up doing it all the time as that’s how rendering contents are guaranteed to become visible on the screen. CPU cache flushing, on the other hand, is something you’re never “supposed” to do, as all I/O operations over PCI and communication between CPU cores is cache-coherent. Except for the GPU. So, we end up using some fairly dire slow-paths in the CPU whenever we end up doing this. UXA isn’t supposed to hit cache flushing paths while drawing, but sometimes it still happens. So, you get UXA performance loss sometimes. On the other hand, failing to dynamically map objects into the GTT means that some objects don’t fit, and so EXA spends a huge amount of time copying data around, in which case EXA suffers.
The difference between DRI1 and DRI2 is due in part to the context switch necessary to get buffer swap commands from the DRI2 application to the X server which owns the ‘real’ front buffer. For an application like glxgears which draws almost nothing, and spends most of its time clearing and swapping, the impact can be significant (note, glxgears is not a benchmark, this is just one of many reasons). On the other hand, having private back buffers means that partially obscured applications will draw faster, not having to loop over clip rectangles in the main rendering loop.
The obvious result here is that we’re at a point where application performance goes all over the map, depending on the hardware platform and particular set of configuration options selected.
Light at Tunnel’s End
The good news is that our redesign is now complete, and we have the architecture we want in place throughout the system — global graphics memory management, kernel mode setting and per-window 3D buffers. This means that the rate of new code additions to the driver has dropped dramatically; almost to zero. Going forward, users should expect this ‘perfect’ combination to work more reliably, faster and better as time goes by.
Right now, we continue to spend all of our time stabilizing the code and fixing bugs. A minor but important piece of this work is to get UXA running without GEM so that we have EXA-like performance on older kernels. That should be fairly straightforward as UXA shares all of the same basic EXA acceleration code, and the EXA pixmap migration stuff works best when it works in the most simplistic fashion possible (move to GPU when drawing, move out only under memory pressure), something which we can provide in the GEM emulation layer already present under UXA.
Our overall plan is to focus our efforts on the ‘one true configuration’. The best way to do that is to work on reducing the number of supported configurations until we get to just that one. First on the block are XAA and EXA. XAA because no-one should have to use that anymore, and EXA because it’s just UXA with some pixmap management stuff we don’t need. There’s no reason UXA should be slower than EXA, once the various hidden performance bugs are fixed.
At the same time, DRI1 support will be removed. We cannot support compositing managers under DRI1, nor can we support frame buffer resize and a host of other new features. You’ll still get a desktop without DRI1, you just won’t get accelerated OpenGL. With the necessary infrastructure in the kernel and X server already released, this seems like the right time to switch off a huge pile of code.
Initial measurements from this work show that we’ll be shrinking our codebase by about 10%.
Moving beyond this next quarterly release, the remaining ‘legacy’ piece is the user mode setting code. Something like 50% of the code in the 2D driver relates this this, so removing it will rather significantly reduce our code base. You can only imagine how excited we are about this prospect.
The goal is to take the driver we’ve got and produce a leaner, faster more stable driver in the next few releases to come.