gem update

Eric and I have been busy hacking away at GEM, the Graphics Execution Manager. GEM is a memory and command ring manager for Intel integrated graphics. GEM itself is working out quite well; we haven't found any terrible surprises in that area. I thought I'd take a few minutes and write up some of the things we've discovered and what we're doing.

Kernel Patches

First off, I've published the GEM kernel patches so that anyone can give this code a try. The first patch was so trivial (exporting shmem_file_setup) that I didn't bother, but the patch to export shmem_getpage is quite a bit longer as it also has to export an associated enum.

There's also a patch for the agp driver which re-writes the GATT on resume. That one isn't GEM-specific, but as we assume the GATT is preserved across VT switches, it's necessary to make GEM survive suspend/resume. It has also been accepted into -mm and should land upstream sometime soon.

Writing data to the GPU

One of the central ideas in GEM is the recognition that cache management plays a huge role in moving data between the CPU and GPU. Because the CPU and GPU are not cache coherent, applications must either use uncached writes from the CPU or explicitly flush the CPU cache to get data transferred. There are several different ways of doing uncached writes:

Uncached page table entries. The requirement here is that all mappings to this page must be uncached, so you can't simply create an uncached mapping when you want writes to be flushed out. The page must be flushed from all CPU caches at allocation time. Worse -- the TLBs of all CPUs must be synchronized so that everyone agrees on the caching mode for every page. While flushing the CPU caches isn't terrible as we can use clflush, the TLB flush requires an IPI, making the allocation of uncached pages very expensive.
Writing through a suitable MTRR-mapped address. This is how we've always done writes to the GPU in the past -- the graphics aperture is covered by write-combining MTRR entries so that writes will be sent to memory. This requires that the destination pages be mapped through the GATT so that they appear under the graphics aperture, which (again) requires that the page contents be flushed from CPU caches. However, TLB entries needn't be flushed as we aren't changing any PTEs.
Non-temporal stores (movnti, movntq, etc). The kernel already uses these when copying data around to avoid filling the cache with useless data. However, non-temporal stores don't actually guarantee that data won't end up sitting in the cache. In particular, if the destination is already sitting in a cache line, then the store will not force that cache line to be flushed. So, while this avoids filling the cache with a lot of additional data, it doesn't provide the necessary guarantee that data will be visible to the GPU.
Using clflush. This makes sure all CPU caches are flushed and the data written to memory. clflush isn't cheap, but as it uses the cache coherence protocols, it need only run on one CPU. Combine this with non-temporal stores and you get a fairly cache-friendly mechanism without the cost of uncached page allocation.

If the kernel offered a cheap way of allocating uncached (or write combining) pages (presumably by constructing a pool of pages ready for uncached use), that might be interesting. However, uncached writes also means uncached reads, and sometimes we do read data back from the GPU. So, we're ignoring this option at present.

Writing data through a write-combining MTRR has always worked reasonably well; there were older processors for which WC writes were slower than WB, but that (fortunately) is no longer true. The big drawback in using the MTRR is that we must allocate a portion of the limited GATT for objects that we want to access in this way. If we want to swap these pages, or need GATT space for other objects we would have to remove them from the GATT. The performance issue here (again) is that reads through a WC mapping are uncached, which means dramatically lower performance.

So, what does all of this mean in the GEM context?

First off, we want to try and treat CPU->GPU data transfers as I/O instead of memory-mapping. This means knowing precisely what data are being moved and when that happens. If we map objects with caching enabled to avoid GATT fun and improve read performance, then when those objects are moved back to the GPU, we must assume that they are entirely dirtied and flush the whole object. Treating this as I/O means having the kernel do all of the writes, which allows all kinds of flexibility on mapping.

Secondly, for operations that can't easily be treated as I/O, it means making explicit choices about where to map objects. When reading an object from the CPU (as when using texture data), we certainly don't want to use an uncached mapping -- if that object isn't written by the CPU, then we needn't flush when switching back to the GPU either. However, for rendering targets as large as the frame buffer, mapping them cached means performing an enormous clflush sequence when moving back to the GPU, so we probably need to make the GATT-based WC mappings work. Currently, GEM doesn't manage this -- all objects are mapped cached, so software fallbacks end up doing a lot of cache flushing.

Using the I/O model to write data from user space into buffers for the GPU leaves us with some flexibility in the kernel implementation. We've tried two different mechanisms and I'm working on a third:

Use the existing pwrite code from the shmem file system. Follow that with calls to clflush when the object is mapped to the GPU. This works out very well when the batch buffers were full, but partially filled buffers end up causing unnecessary clflush calls. Also, the clflush requires an extra kmap_atomic/kunmap_atomic pair.
Map the object to the GATT and then use kmap_atomic_prot_pfn to map pages transiently into kernel space. This gives us WC write performance, eliminating any need to use clflush. Performance is improved, but it abuses kmap_atomic_pfn -- a function which is only really supposed to be given physical memory pages. On kernels with CONFIG_HIGHMEM set, it works out fine, but without that, you get a garbage PTE. Performance here is quite a bit better, eliminating flushing from profiles.
Hand-code the pwrite function to map the pages, copy the data and flush the cache all in one step. I'm hopeful that this will end up as fast as the GATT-based scheme, but avoid the abuse of the kmap_atomic_pfn function.

The first scheme exposed the flushing as an expensive operation; profiles for typical games would have flushing taking 5-7% of the CPU. The second scheme eliminated that, raising performance and lowering CPU usage. We'll see if the third scheme is successful; if not, we'll have to lobby the kernel developers to give us a supported way of transiently mapping I/O devices from kernel space.

Tiling and memory channels

Mapping graphical objects in a linear frame buffer where the data for each scanline is neatly arranged together in memory is an obvious representation for the data; it makes constructing scan-out hardware easy, and also makes writing software rendering code easier as well. Unfortunately, graphical objects generally span adjacent portions of many scanlines. Accessing memory in this order generally runs counter to memory architectures; a vertical line will end up writing one pixel in one cache line of one page. You end up spending a huge amount of time reading/writing cache lines and refilling TLB entries.

The usual solution to this is to have a single page hold pixels for multiple scanlines in the same region of the screen. Tiling the screen with these blocks of pixels provides dramatic performance improvements (we see about a 50% performance improvement from tiling the back buffer). Intel hardware supports two different tiling modes. With 'X' tiling, each page forms a rectangle that is 512 bytes wide by 8 scanlines high. 'Y' tiles are 128 bytes wide by 32 scanlines high.

A separate, but related issue is dealing with multiple memory channels. To see maximum memory bandwidth, the system needs to interleave access between memory channels. The memory system is arranged so that successive cache lines come from alternate memory channels, which means that address bit 6 ends up being the 'channel select' bit. This is related, because tiled graphics breaks the assumption about sequential access -- walk down a tiled buffer and you would hit the same memory channel each time.

To fix this, the hardware actually modifies address bit 6 using other portions of the address. For X tiling, it xor's in bit 9 and 10 of the address when computing bit six; this means that vertically adjacent pixels are always in alternate memory channels. Y tiling uses only bit 9, but the pixels are already stirred around in that format enough that this one bit suffices.

The CPU doesn't share in this particular adventure, so when it accesses these objects directly (not through the GATT), it sees things mixed around.

Of course, sometimes the hardware doesn't bother swizzling bit 6 like this; if you have only a single memory channel, it doesn't help. But, neither does it hurt, so some hardware will swizzle even in this case. We haven't found any registers that tell us when the swizzling is going on.

Not to be left out of the bit 6 fun, the CPU-facing memory controller also improves interleaving by mixing bits up. It can either stir in bit 11 or bit (uh-oh) 17. At least this behavior is documented and visible in a register visible to the CPU. Bit 11 is workable; we just stir that into the mix when computing bit 6 to unswizzle before the memory controller re-swizzles and things work out fine. Bit 17 is problematic. It's not a virtual address bit, it's a physical address bit. Which means that the physical memory layout of data stored in RAM depends on where in memory a page is sitting. Move the page around so that bit-17 changes and the data will flip channels.

Of course, as the GPU does its own bit-6 swizzling for tiled objects, it doesn't bother with the CPU memory swizzle. Which means that tiled data written by the CPU and read by the GPU will appear to flip around, depending on where in physical memory that data resides.

All of these bit-6 adventures are holding up GEM development at present; software fallbacks reading or writing to tiled objects are quite broken on many machines, and the way they're broken depends on how the CPU and GPU memory controllers are set up.

We already have code that does the GPU swizzling and only need to add auto-detection to know when to use it. But the bit-17 CPU swizzling may cause some significant problems. First off, we'd have to make all CPU access to tiled objects go through the GATT, hurting read performance and complicating our mapping code -- user mode doesn't know anything about physical addresses and so couldn't swizzle. Secondly, we would have to find some way to ensure that bit 17 of all tiled pages didn't change across swap operations (as swapping will read and write through the CPU swizzle). That would either mean pinning tiled objects in memory (ouch), or hacking up the kernel memory manager to add a very strange constraint on page allocation, or swizzle pages before mapping them to the GATT.