UMA Acceleration Architecture

What's up this week

The last week has certainly been entertaining. We're quickly merging a pile of new code into the driver and trying to get everything building in one place so that people can play with stuff before we release.

Getting 2D on top of GEM

One of the big missing pieces last week was getting the 2D driver working with Pixmaps as GEM objects. This is critical as we move towards unified kernel memory management for rendering resources to allow us to use objects across multiple APIs. The most pressing need here is to enable the GLX_EXT_texture_from_pixmap extension in an efficient fashion.

So, what's the plan then? Fairly simple; allocate GEM objects for every pixmap and then use GEM relocations to manage access to them. No need for the 2D driver to even know what's bound to the GTT; it can treat every Pixmap exactly alike and let the kernel manage the low-level hardware details. Our experience with the 3D driver has been quite good; GEM is easy to use and reasonably efficient.

The initial thought was that we'd use EXA's ability to forward pixmap creation back to the driver and have our driver call-back create the GEM object. However, in looking at that, it turns out to have a terrible (and incomplete) API. The driver has no say in the pixmap layout, it must use the EXA-enforced pixel organization. In a land of tiled pixmaps, that's not OK. Further enquiry showed a wealth of other code which is useless in our uniform Pixmap environment. Damage tracking, and enforce hardware synchronization are wasteful performance robbing activities.

Ok, so if EXA isn't what we want, then what is? Well, I like the basic EXA acceleration plan -- accelerate solid fills, copy area and the composite operation and leave everything else to software. In fact, the whole EXA drawing API is just fine, it's just the wasteful EXA code that isn't necessary.

UXA -- the UMA Acceleration Architecture

Ok, so instead of hacking up EXA and trying to make it work for the GEM driver and existing drivers, I decided to just make it work for GEM on UMA hardware and see what it looked like. The hope is that we'll find some way to either patch EXA or at least find a way to share the low-level rendering code between UXA and EXA. For now, UXA lives in the intel driver itself; once we figure out how we want the X server rendering infrastructure to work, we'll merge whatever results back into the core server.

I started UXA by just copying the existing EXA code and running an edit script to change all of the names. Then, I went through the code and removed everything dealing with pixmap migration, damage computation or explicit global hardware synchronization. The only synchronization primitive left is the prepare_access/finish_access pair which signals the start and end of software drawing. The hardware driver is expected to deal with all other synchronization issues itself.

Oddly, GEM does rendering synchronization automatically when rendering with the hardware, and provides simple primitives to provide for software fallbacks. The key here is that we never need to idle the whole chip, we only need to wait for it to finish working on whatever objects are currently being drawn with. The goal is to avoid artificial serialization.

The result is less than 5000 lines of code, as compared to EXA which has about 7500 lines.

Yeah, but does it work?

The short answer is "Yes, it works". The longer answer is "Yes, with limitations". The biggest limitation right now is that GEM objects can only be mapped directly by the CPU. For lots of operations, this is exactly what you want; a fully cached view into the objects as it offers full performance for CPU-bound rendering operations.

However, it has one performance problem and one functional limitation.

The performance problem is that using the CPU cache with these objects means flushing the CPU cache whenever switching between CPU and GPU rendering. CPU cache flushing is horribly expensive, enough so that it's often far better to take the huge performance penalty of using un-cached reads if the number of reads is small.

Yes, we could create write-combining PTEs for this direct mapping, but constructing write-combining PTEs is also really expensive as that involves flushing those PTEs from every CPU TLB, which requires an inter-processor interrupt. Of course, you can't just create a write-combining PTE, you have to make sure that the page it maps is not in any CPU cache, so you have to perform a CPU cache flush as well.

Someday maybe this won't be true; there are plans afoot within the Linux kernel to make this reasonably efficient. Perhaps this will happen before we get our flying cars.

So, it's a performance problem; we can deal with that.

Tiled Surfaces

What we can't deal with is how tiled surfaces work under a CPU map. A normal surface has an entire scanline mapped to a linear section of memory. This places vertically adjacent pixels a fair distance apart in memory. Drawing a vertical line means touching two different cache lines and two different pages. Even a large cache and TLB will not help much if you draw tall objects. Tiled surfaces arrange for nearby screen pixels to be nearby in memory, usually by constructing the surface from a set of rectangular page-sized tiles. Vertically adjacent pixels will then be in the same page, and can even be in the same cache line in some cases.

The performance benefits for tiled surfaces are obvious; fewer cache and TLB misses. The cost to the hardware is fairly small; just some gates to stir addresses around when fetching and storing pixels. However, the cost to software is fairly large; computing the address of a pixel now involves some fairly ugly computation.

We already managed to make Mesa deal with tiled surfaces. That was fairly easy as Mesa has a single span-based pixel fetch and store architecture. Write new span accessing functions and the rest of the sw rendering code just works.

Fixing the X server 2D software rendering code is another matter entirely -- there's a lot of it, and it all wants to touch memory in a linear fashion. Aaron Plattner from nVidia actually did go and whack fb to make it work; every pixel fetch or store goes through a function call which is passed the nominal linear address of the pixel. These accessor/setter functions then munge that address into the actual tiled address. However, that's yet another huge performance impact for software rendering.

Hardware De-Tiling

A better solution is to just use the hardware. When a tiled surface is bound to the GTT, it is visible to everyone using linear addresses; those addresses are swizzled in the hardware and head out to memory in tiled form. There's no performance benefit from the CPU as its TLBs and caches all see the linear address, but it doesn't have to deal in a non-linear space.

The second benefit of the GTT map is that it lives under a write-combining MTRR, so all accesses to memory are write-combining and not write-back. This eliminates all of the CPU cache coherence issues and leaves us back with the old performance that we know and love -- fast writes and really slow reads, but no penalty for switching rapidly between GPU and CPU.

What's Next?

So, the basic Pixmaps-in-GEM code is up and running in the gem-pixmap branch of my driver repository, git://people.freedesktop.org/~keithp/xf86-video-intel. The next step will be to integrate Carl Worth's 965 render changes which place all of the temporary data that it uses into GEM objects as well. That will finish the DRI2 enabling work and allow us to provide zero-copy texture-from-pixmap support.

However, before that can really go main-stream, we need to get the GTT object mapping to fix tiled surface support and get back some performance lost to the CPU cache flushing. We'll see if Kristian is ready with DRI2 tomorrow, if not, I'll probably spend the day figuring out enough additional parts of the Linux MM code to get my GTT maps working.