Keithp.com/ blog
RSS Add a new post titled:
UMA Acceleration Architecture

What’s up this week

The last week has certainly been entertaining. We’re quickly merging a pile of new code into the driver and trying to get everything building in one place so that people can play with stuff before we release.

Getting 2D on top of GEM

One of the big missing pieces last week was getting the 2D driver working with Pixmaps as GEM objects. This is critical as we move towards unified kernel memory management for rendering resources to allow us to use objects across multiple APIs. The most pressing need here is to enable the GLX_EXT_texture_from_pixmap extension in an efficient fashion.

So, what’s the plan then? Fairly simple; allocate GEM objects for every pixmap and then use GEM relocations to manage access to them. No need for the 2D driver to even know what’s bound to the GTT; it can treat every Pixmap exactly alike and let the kernel manage the low-level hardware details. Our experience with the 3D driver has been quite good; GEM is easy to use and reasonably efficient.

The initial thought was that we’d use EXA’s ability to forward pixmap creation back to the driver and have our driver call-back create the GEM object. However, in looking at that, it turns out to have a terrible (and incomplete) API. The driver has no say in the pixmap layout, it must use the EXA-enforced pixel organization. In a land of tiled pixmaps, that’s not OK. Further enquiry showed a wealth of other code which is useless in our uniform Pixmap environment. Damage tracking, and enforce hardware synchronization are wasteful performance robbing activities.

Ok, so if EXA isn’t what we want, then what is? Well, I like the basic EXA acceleration plan — accelerate solid fills, copy area and the composite operation and leave everything else to software. In fact, the whole EXA drawing API is just fine, it’s just the wasteful EXA code that isn’t necessary.

UXA — the UMA Acceleration Architecture

Ok, so instead of hacking up EXA and trying to make it work for the GEM driver and existing drivers, I decided to just make it work for GEM on UMA hardware and see what it looked like. The hope is that we’ll find some way to either patch EXA or at least find a way to share the low-level rendering code between UXA and EXA. For now, UXA lives in the intel driver itself; once we figure out how we want the X server rendering infrastructure to work, we’ll merge whatever results back into the core server.

I started UXA by just copying the existing EXA code and running an edit script to change all of the names. Then, I went through the code and removed everything dealing with pixmap migration, damage computation or explicit global hardware synchronization. The only synchronization primitive left is the prepare_access/finish_access pair which signals the start and end of software drawing. The hardware driver is expected to deal with all other synchronization issues itself.

Oddly, GEM does rendering synchronization automatically when rendering with the hardware, and provides simple primitives to provide for software fallbacks. The key here is that we never need to idle the whole chip, we only need to wait for it to finish working on whatever objects are currently being drawn with. The goal is to avoid artificial serialization.

The result is less than 5000 lines of code, as compared to EXA which has about 7500 lines.

Yeah, but does it work?

The short answer is “Yes, it works”. The longer answer is “Yes, with limitations”. The biggest limitation right now is that GEM objects can only be mapped directly by the CPU. For lots of operations, this is exactly what you want; a fully cached view into the objects as it offers full performance for CPU-bound rendering operations.

However, it has one performance problem and one functional limitation.

The performance problem is that using the CPU cache with these objects means flushing the CPU cache whenever switching between CPU and GPU rendering. CPU cache flushing is horribly expensive, enough so that it’s often far better to take the huge performance penalty of using un-cached reads if the number of reads is small.

Yes, we could create write-combining PTEs for this direct mapping, but constructing write-combining PTEs is also really expensive as that involves flushing those PTEs from every CPU TLB, which requires an inter-processor interrupt. Of course, you can’t just create a write-combining PTE, you have to make sure that the page it maps is not in any CPU cache, so you have to perform a CPU cache flush as well.

Someday maybe this won’t be true; there are plans afoot within the Linux kernel to make this reasonably efficient. Perhaps this will happen before we get our flying cars.

So, it’s a performance problem; we can deal with that.

Tiled Surfaces

What we can’t deal with is how tiled surfaces work under a CPU map. A normal surface has an entire scanline mapped to a linear section of memory. This places vertically adjacent pixels a fair distance apart in memory. Drawing a vertical line means touching two different cache lines and two different pages. Even a large cache and TLB will not help much if you draw tall objects. Tiled surfaces arrange for nearby screen pixels to be nearby in memory, usually by constructing the surface from a set of rectangular page-sized tiles. Vertically adjacent pixels will then be in the same page, and can even be in the same cache line in some cases.

The performance benefits for tiled surfaces are obvious; fewer cache and TLB misses. The cost to the hardware is fairly small; just some gates to stir addresses around when fetching and storing pixels. However, the cost to software is fairly large; computing the address of a pixel now involves some fairly ugly computation.

We already managed to make Mesa deal with tiled surfaces. That was fairly easy as Mesa has a single span-based pixel fetch and store architecture. Write new span accessing functions and the rest of the sw rendering code just works.

Fixing the X server 2D software rendering code is another matter entirely — there’s a lot of it, and it all wants to touch memory in a linear fashion. Aaron Plattner from nVidia actually did go and whack fb to make it work; every pixel fetch or store goes through a function call which is passed the nominal linear address of the pixel. These accessor/setter functions then munge that address into the actual tiled address. However, that’s yet another huge performance impact for software rendering.

Hardware De-Tiling

A better solution is to just use the hardware. When a tiled surface is bound to the GTT, it is visible to everyone using linear addresses; those addresses are swizzled in the hardware and head out to memory in tiled form. There’s no performance benefit from the CPU as its TLBs and caches all see the linear address, but it doesn’t have to deal in a non-linear space.

The second benefit of the GTT map is that it lives under a write-combining MTRR, so all accesses to memory are write-combining and not write-back. This eliminates all of the CPU cache coherence issues and leaves us back with the old performance that we know and love — fast writes and really slow reads, but no penalty for switching rapidly between GPU and CPU.

What’s Next?

So, the basic Pixmaps-in-GEM code is up and running in the gem-pixmap branch of my driver repository, git://people.freedesktop.org/~keithp/xf86-video-intel. The next step will be to integrate Carl Worth’s 965 render changes which place all of the temporary data that it uses into GEM objects as well. That will finish the DRI2 enabling work and allow us to provide zero-copy texture-from-pixmap support.

However, before that can really go main-stream, we need to get the GTT object mapping to fix tiled surface support and get back some performance lost to the CPU cache flushing. We’ll see if Kristian is ready with DRI2 tomorrow, if not, I’ll probably spend the day figuring out enough additional parts of the Linux MM code to get my GTT maps working.

Posted Tue Aug 5 23:28:32 2008
X output status july 2008

Ok, so I didn’t get a lot of time for coding last week. And, this week there’s OSCON, so coding time will be short again. I figured I should spend some time writing up a brief report about where X output is at today.

Output Hotplug

I think this stuff is fairly solid these days, although we don’t have much in the way of auto-detection of monitor connect/disconnect. There are two reasons here:

  1. The hardware notifies the operating system via an interrupt. Given mode setting code in user space, dealing with interrupts is a huge pain and hence hasn’t been hooked up yet (see below).

  2. Analog outputs (VGA, TV) do detection using impedance changes in the output signal path. This means we have to keep them active if we want to detect a connection. That takes a lot of power (about 1W to light up the VGA output without a monitor connected). What we could do is detect when a monitor was unplugged; that’s free.

There are a few other random improvements that are coming soon, like CEA additions to the EDID parsing code. These are additional data blocks that follow the standard EDID data and are used for ‘consumer electronics’ devices. Supporting these should make more HDMI monitors ‘just work’.

Initial Mode Selection

Detecting connected monitors is fine, but one thing we haven’t really solved is what to do when you have more than one connected when the server starts. My initial code would pick one ‘primary’ monitor, light that up at its preferred size and then pick modes for the other monitors which were as close as possible to the primary monitor size without being larger. Obviously, I liked that as it meant my laptop always came up looking correct on the LVDS and my external VGA would show most of the screen.

However, this was reported to confuse a lot of users. I can imagine that starting the X server with one of the outputs connected but not turned on would make for some ‘interesting’ support calls. So, now the X server picks a mode which all outputs can support and uses that everywhere. Sadly, this means that my laptop panel gets some random scaled mode (usually 1024x768) which looks quite awful.

I think we need something better than either of these choices, but I’m not quite sure how it should work.

Kernel Mode Setting

A bunch of people, including Jesse Barnes and Dave Airlie, have been hacking to move the output configuration code into the kernel. This will solve lots of little problems, like how to display kernel panic messages, and how to deal with interrupts for output hotplug.

This code is up and running fairly well these days, but depends on a kernel memory manager to deal with frame buffers. The integration of GEM into the kernel is blocking this work, but I’m hopeful that this will be sorted out in the next couple of weeks.

GEM — the Graphics Execution Manager

Work here was stalled for a few weeks while we sorted out memory channel interleaving issues. Now things are moving again, and we’re working on getting it stable enough to merge into master. That means fixing a few more critical bugs that the Intel QA team has identified.

One of these bugs is that our GL conformance tests weren’t working right; that turned out to be caused by tests reading back data from the frame buffer one pixel at a time. Our read-back path passed through the GEM memory domain code to pull objects back from GTT space to CPU space. That meant flushing the front, back and depth buffers from the CPU cache. With each of those at 16MB, reading a single pixel took long enough that the tests would time-out. Increasing the timeouts to ‘way too long’ is making them run, but tests which would complete in a few hours are now taking days.

We’ve got two different plans for fixing the read-back path:

  1. Use pread to access precisely the data we need. This would involve flushing a single cache line for the tests above.

  2. Mapping the back buffer through the GTT. This would eliminate the need to clflush anything as the GTT mappings are write combining and so reads bypass the cache.

Eric is working on the former, and I’m working on the latter. More news later (this week?) when we see which one wins.

Composite Acceleration

With Owen Taylor’s change to the glyph management code in the server, Eric and Carl were able to change the driver to batch multiple glyph drawing operations into a command buffer. Once Carl had this working, we went from 13000 glyphs/sec to 103000 glyphs/sec. Obviously we’re hoping for even larger improvements as a pure software solution is well over 1 million glyphs/sec. Even still, 103000 glyphs/sec is enough to make my desktop vastly more usable, and using the software path means losing a lot of other useful acceleration.

DRI2 — Redirected Direct Rendering

Right now, direct rendered GL applications (which is the fastest way we can do GL at present) get drawn to a giant screen-sized back buffer and then copied from there to the screen at swap buffers time. Because everyone shares the same back buffer, you get to clip your drawing as if you were drawing directly to the screen. While this normally doesn’t matter much (aside from some performance costs associated with lots of clip rectangles), when you’re running a compositing manager (like compiz), the 3D applications end up ignoring the per-window offscreen pixmap and spam their output directly to the real frame buffer.

DRI2, written by Kristen Høgsberg, solves this by changing how direct rendering works and giving everyone a private back-buffer to draw to. Now, at buffer swap time, that private back-buffer can be copied to the window’s pixmap and compiz is happy.

This work has been around for a few months, but depends on a TTM-based memory manager. That dependency isn’t very strong, and krh has promised to fix it shortly. Once that’s done, getting the GEM driver to support DRI2 won’t take long, and we’ll have our fully composited desktop running. With luck, that’ll happen before September.

Final Words

As you can see, we’re nearing the end of our long X output rework saga, with most of the pieces falling into place in the next month or two.

Posted Mon Jul 21 18:25:43 2008
intel gfxcon 2008

With our channel-interleaving mostly sorted out, Eric and I spent a short time figuring out what to do about it all. The important thing we learned is that the hardware has two modes: linear and tiled. The CPU always accesses memory in linear mode, while the GPU can either use linear or tiled mode. What is important here is that ‘linear’ mode uses the same interleaving, whether from the GPU or CPU. So, we can only get into trouble if we use tiled mode from the GPU and linear mode from the CPU.

We came up with a fairly simple plan to resolve this issue:

On Wednesday and Thursday, Jesse Barnes, Zou Nan Hai and I got to attend the 2008 Intel Gfxcon. Put on by Intel GDG (no, I don’t know what that means, but it’s the Intel integrated graphics group), it was held in Folsom, CA. The trip down on the Intel shuttle was uneventful, but on arrival I found the 42° air full of smoke from wild fires. My plan to bicycle from the hotel to Intel suddenly seemed like it would be a lot less fun.

The conference was huge fun. As usual, meeting people in person is always better than via email or even on a teleconference. After working at Intel for nearly three years now, I’m starting to feel a bit more a part of the organization and less of an outsider. Linux continues to gain mindshare within the company as it gains visibility in our customers’ products.

I got to present our current GEM work to a fair-sized crowd, including several people from our Windows driver development team. I was interested to hear what they thought about the architecture and was pleased to learn that a lot of what we’re doing is similar to how the Vista driver works. While we can’t share source code, it is at least nice that we can share ideas about how best to drive the hardware at the lowest levels of the system.

That evening, Jesse, Nan Hai and I managed to find decent steak-and-potatoes, but our attempt to locate gelato ended in near-failure — Google led us to a mini-mall in a neighboring town. Upon failing to locate the expected restaurant, we enquired in an Italian place who first told us that the gelato place had closed five years earlier, and then convinced us that they could provide gelato. Served freezer-burnt ice cream, we quickly left for our respective lodgings.

The next evening we found acceptable California-style Mexican food. As usual, portions were large enough to push any thought of desert from our minds. We lumbered back to Nan Hai’s hotel room and hacked for several hours, although the free wifi there left a lot to be desired. Then, we chatted about how to get rid of tearing in textured video.

For vblank synchronized textured video, I’m hoping we’ll be able to queue the update to the kernel and have it perform the necessary blts. This would mean interrupting the current command stream and switching over to a separate stream. It probably means using separate hardware contexts, which would be a good thing in any case as we could eliminate the per-batch-buffer configuration that the 3D driver currently performs. Work on this will need to start with multiple hardware context support, then move on to interrupting the ring and then figuring out how to manage the blts along with clip list changes etc.

Posted Mon Jul 14 22:19:41 2008
gem update

Eric and I have been busy hacking away at GEM, the Graphics Execution Manager. GEM is a memory and command ring manager for Intel integrated graphics. GEM itself is working out quite well; we haven’t found any terrible surprises in that area. I thought I’d take a few minutes and write up some of the things we’ve discovered and what we’re doing.

Kernel Patches

First off, I’ve published the GEM kernel patches so that anyone can give this code a try. The first patch was so trivial (exporting shmem_file_setup) that I didn’t bother, but the patch to export shmem_getpage is quite a bit longer as it also has to export an associated enum.

There’s also a patch for the agp driver which re-writes the GATT on resume. That one isn’t GEM-specific, but as we assume the GATT is preserved across VT switches, it’s necessary to make GEM survive suspend/resume. It has also been accepted into -mm and should land upstream sometime soon.

Writing data to the GPU

One of the central ideas in GEM is the recognition that cache management plays a huge role in moving data between the CPU and GPU. Because the CPU and GPU are not cache coherent, applications must either use uncached writes from the CPU or explicitly flush the CPU cache to get data transferred. There are several different ways of doing uncached writes:

  1. Uncached page table entries. The requirement here is that all mappings to this page must be uncached, so you can’t simply create an uncached mapping when you want writes to be flushed out. The page must be flushed from all CPU caches at allocation time. Worse — the TLBs of all CPUs must be synchronized so that everyone agrees on the caching mode for every page. While flushing the CPU caches isn’t terrible as we can use clflush, the TLB flush requires an IPI, making the allocation of uncached pages very expensive.

  2. Writing through a suitable MTRR-mapped address. This is how we’ve always done writes to the GPU in the past — the graphics aperture is covered by write-combining MTRR entries so that writes will be sent to memory. This requires that the destination pages be mapped through the GATT so that they appear under the graphics aperture, which (again) requires that the page contents be flushed from CPU caches. However, TLB entries needn’t be flushed as we aren’t changing any PTEs.

  3. Non-temporal stores (movnti, movntq, etc). The kernel already uses these when copying data around to avoid filling the cache with useless data. However, non-temporal stores don’t actually guarantee that data won’t end up sitting in the cache. In particular, if the destination is already sitting in a cache line, then the store will not force that cache line to be flushed. So, while this avoids filling the cache with a lot of additional data, it doesn’t provide the necessary guarantee that data will be visible to the GPU.

  4. Using clflush. This makes sure all CPU caches are flushed and the data written to memory. clflush isn’t cheap, but as it uses the cache coherence protocols, it need only run on one CPU. Combine this with non-temporal stores and you get a fairly cache-friendly mechanism without the cost of uncached page allocation.

If the kernel offered a cheap way of allocating uncached (or write combining) pages (presumably by constructing a pool of pages ready for uncached use), that might be interesting. However, uncached writes also means uncached reads, and sometimes we do read data back from the GPU. So, we’re ignoring this option at present.

Writing data through a write-combining MTRR has always worked reasonably well; there were older processors for which WC writes were slower than WB, but that (fortunately) is no longer true. The big drawback in using the MTRR is that we must allocate a portion of the limited GATT for objects that we want to access in this way. If we want to swap these pages, or need GATT space for other objects we would have to remove them from the GATT. The performance issue here (again) is that reads through a WC mapping are uncached, which means dramatically lower performance.

So, what does all of this mean in the GEM context?

First off, we want to try and treat CPU->GPU data transfers as I/O instead of memory-mapping. This means knowing precisely what data are being moved and when that happens. If we map objects with caching enabled to avoid GATT fun and improve read performance, then when those objects are moved back to the GPU, we must assume that they are entirely dirtied and flush the whole object. Treating this as I/O means having the kernel do all of the writes, which allows all kinds of flexibility on mapping.

Secondly, for operations that can’t easily be treated as I/O, it means making explicit choices about where to map objects. When reading an object from the CPU (as when using texture data), we certainly don’t want to use an uncached mapping — if that object isn’t written by the CPU, then we needn’t flush when switching back to the GPU either. However, for rendering targets as large as the frame buffer, mapping them cached means performing an enormous clflush sequence when moving back to the GPU, so we probably need to make the GATT-based WC mappings work. Currently, GEM doesn’t manage this — all objects are mapped cached, so software fallbacks end up doing a lot of cache flushing.

Using the I/O model to write data from user space into buffers for the GPU leaves us with some flexibility in the kernel implementation. We’ve tried two different mechanisms and I’m working on a third:

  1. Use the existing pwrite code from the shmem file system. Follow that with calls to clflush when the object is mapped to the GPU. This works out very well when the batch buffers were full, but partially filled buffers end up causing unnecessary clflush calls. Also, the clflush requires an extra kmap_atomic/kunmap_atomic pair.

  2. Map the object to the GATT and then use kmap_atomic_prot_pfn to map pages transiently into kernel space. This gives us WC write performance, eliminating any need to use clflush. Performance is improved, but it abuses kmap_atomic_pfn — a function which is only really supposed to be given physical memory pages. On kernels with CONFIG_HIGHMEM set, it works out fine, but without that, you get a garbage PTE. Performance here is quite a bit better, eliminating flushing from profiles.

  3. Hand-code the pwrite function to map the pages, copy the data and flush the cache all in one step. I’m hopeful that this will end up as fast as the GATT-based scheme, but avoid the abuse of the kmap_atomic_pfn function.

The first scheme exposed the flushing as an expensive operation; profiles for typical games would have flushing taking 5-7% of the CPU. The second scheme eliminated that, raising performance and lowering CPU usage. We’ll see if the third scheme is successful; if not, we’ll have to lobby the kernel developers to give us a supported way of transiently mapping I/O devices from kernel space.

Tiling and memory channels

Mapping graphical objects in a linear frame buffer where the data for each scanline is neatly arranged together in memory is an obvious representation for the data; it makes constructing scan-out hardware easy, and also makes writing software rendering code easier as well. Unfortunately, graphical objects generally span adjacent portions of many scanlines. Accessing memory in this order generally runs counter to memory architectures; a vertical line will end up writing one pixel in one cache line of one page. You end up spending a huge amount of time reading/writing cache lines and refilling TLB entries.

The usual solution to this is to have a single page hold pixels for multiple scanlines in the same region of the screen. Tiling the screen with these blocks of pixels provides dramatic performance improvements (we see about a 50% performance improvement from tiling the back buffer). Intel hardware supports two different tiling modes. With ‘X’ tiling, each page forms a rectangle that is 512 bytes wide by 8 scanlines high. ‘Y’ tiles are 128 bytes wide by 32 scanlines high.

A separate, but related issue is dealing with multiple memory channels. To see maximum memory bandwidth, the system needs to interleave access between memory channels. The memory system is arranged so that successive cache lines come from alternate memory channels, which means that address bit 6 ends up being the ‘channel select’ bit. This is related, because tiled graphics breaks the assumption about sequential access — walk down a tiled buffer and you would hit the same memory channel each time.

To fix this, the hardware actually modifies address bit 6 using other portions of the address. For X tiling, it xor’s in bit 9 and 10 of the address when computing bit six; this means that vertically adjacent pixels are always in alternate memory channels. Y tiling uses only bit 9, but the pixels are already stirred around in that format enough that this one bit suffices.

The CPU doesn’t share in this particular adventure, so when it accesses these objects directly (not through the GATT), it sees things mixed around.

Of course, sometimes the hardware doesn’t bother swizzling bit 6 like this; if you have only a single memory channel, it doesn’t help. But, neither does it hurt, so some hardware will swizzle even in this case. We haven’t found any registers that tell us when the swizzling is going on.

Not to be left out of the bit 6 fun, the CPU-facing memory controller also improves interleaving by mixing bits up. It can either stir in bit 11 or bit (uh-oh) 17. At least this behavior is documented and visible in a register visible to the CPU. Bit 11 is workable; we just stir that into the mix when computing bit 6 to unswizzle before the memory controller re-swizzles and things work out fine. Bit 17 is problematic. It’s not a virtual address bit, it’s a physical address bit. Which means that the physical memory layout of data stored in RAM depends on where in memory a page is sitting. Move the page around so that bit-17 changes and the data will flip channels.

Of course, as the GPU does its own bit-6 swizzling for tiled objects, it doesn’t bother with the CPU memory swizzle. Which means that tiled data written by the CPU and read by the GPU will appear to flip around, depending on where in physical memory that data resides.

All of these bit-6 adventures are holding up GEM development at present; software fallbacks reading or writing to tiled objects are quite broken on many machines, and the way they’re broken depends on how the CPU and GPU memory controllers are set up.

We already have code that does the GPU swizzling and only need to add auto-detection to know when to use it. But the bit-17 CPU swizzling may cause some significant problems. First off, we’d have to make all CPU access to tiled objects go through the GATT, hurting read performance and complicating our mapping code — user mode doesn’t know anything about physical addresses and so couldn’t swizzle. Secondly, we would have to find some way to ensure that bit 17 of all tiled pages didn’t change across swap operations (as swapping will read and write through the CPU swizzle). That would either mean pinning tiled objects in memory (ouch), or hacking up the kernel memory manager to add a very strange constraint on page allocation, or swizzle pages before mapping them to the GATT.

Posted Fri Jul 4 13:25:03 2008
kernel-mode-drivers

I mentioned in passing during my Linux.Conf.Au talk that we were looking at moving portions of the video drivers into the kernel. Others, including Alan Coopersmith have started to chime in on this note and I thought I should write a bit more about what I was thinking.

At least a few of the goals are as follows:

  1. BIOS-free Suspend/Resume support
  2. Text-mode panic
  3. Flicker-free graphical boot
  4. Fancy animated user switching

A significant non-goal here is an in-kernel unified API for graphics acceleration. Our current high-level architecture for rendering is in good shape, with user-mode filling a ring buffer with hardware-specific commands and the kernel just dispatching buffers.

Ok, so as you can see from the list above, I’m just talking about device detection, configuration and video mode selection.

Why this sudden desire to move video mode selection into the kernel? Well, for years, we were told by several video card vendors that there was “no way” we could ever figure out how to do mode setting without using the BIOS. If you believe you must use the BIOS, it’s difficult to see how that can be done from within the kernel; executing the BIOS requires either an x86 emulator or vm86 support, neither of which belong in the kernel. For a long time, I believed the video chip vendors. Silly me.

Several people, including Luc Verhaegen, had been moving BIOS-based drivers towards native mode setting. While the BIOS was a nice crutch, real support requires us to follow other operating systems and program the hardware directly. It’s the only way to get at the full range of hardware capabilities, although it does require a lot more code. And, some machines will not quite work right until magic tweaks are added. On balance, the number of machines fixed is greater than the number of machines broken (by a huge margin, given the lack of BIOS support for the native panel mode support in many laptops).

Now, if you have to have BIOS support for video mode selection, you have to wait for usermode to wake up before you can program the video card. There’s not a lot you can do early in the kernel, without some joyous adventure involving initrd devices that include a complete X server. Ick.

Once you embrace native mode setting, suddenly the range of options opens up and you can think about what might be possible if this code were moved down into the kernel. Of course, the first thought is how little you can move.

So, with that, we can start enumerating what capabilities must be in kernel mode to solve the list of problems above.

First, to get BIOS-free suspend/resume support working, we already require the ability to save and restore the entire graphics state, including synchronizing with applications performing graphical operations. For video systems involving external chips (most, these days), this also means saving and restoring the state of those chips.

While we may mock Windows and the BSOD, many people would be far happier with that than the current state of an Xorg system where kernel panics are not announced at all; instead, the screen simply freezes and the user has no idea what has happened. Even the ability to escape from this state is limited to those with the secret handshake knowledge. Getting back to a simple text mode and displaying the panic message requires the ability to program a fixed mode from within the kernel, and the ability to lock out other users of the graphics device (perhaps waiting for the hardware pipeline to drain, if necessary).

Eliminating the screen flashing during boot-up will require the ability to set the desired final graphical mode as soon as the kernel is running (or, even from the boot loader?). I suggest that the precise target mode can be computed during system installation time (and changed for next boot by the user session), so this only requires the ability to read the desired mode configuration from somewhere at boot time. Kristian Høgsberg has been working on an early switch to the final graphical mode, but hasn’t managed to eliminate the flashing.

I envision fancy user switching being implemented by a transition from one X server to another rather than trying to get two user sessions running inside the same X server. Right now, we switch users with a VT switch where the first session switches back to text mode and the second switches back to graphical mode. Flash, Flash. Having the kernel recognise that the two X servers were using the same mode will eliminate the flashing nicely, the only remaining issue is how to animate from one session to the next. I think that’s largely a matter of being able to allocate multiple front buffers and creating an animation client that uses two X server front buffer images to construct the intermediate representations.

Finally, building a kernel API for all of this has become possible because we’ve all come to recognise that there are commonalities in mode selection across video hardware. The Intel driver had some separation between CRTC and output when we started working on it. Luc’s Via driver has been moving this direction for some time. While the Radeon driver uses a different approach (it sets everything on the card every time you touch anything), it too recognises the distinction between CRTC and Output. Matthew Tippett suggested that even the closed source ATI driver worked this way as he, along with Kevin Martin, redesigned RandR 1.2 to work this way.

Because we now have a reasonably common abstraction across a wide range of drivers, it now seems tractable to produce a common API. We’ve already started this inside the X server itself; the hope is that working on the common layer there will inform choices about how the kernel API should look. This will include

  1. Mode setting primitives.
  2. GPU ring buffer management
  3. Memory management of some kind.

Yes, this makes modesetting just an addition to the existing DRM drivers; they’re responsible for the GPU and memory management stuff, and the modesetting driver needs both of those bits to perform the operations listed above, so it cannot be ‘underneath’ the DRM driver. We welcome our new DRM overlords.

Posted Mon Feb 5 22:17:36 2007
randr 1.2 update

We’ve been working busily at RandR 1.2, both cleaning up the extension specification and trying to build a infrastructure that will help people get support into all of the video drivers. I thought I’d let people know where things stand and what remains to be done.

First off, things that are fairly stable now:

One piece that needs a few new features is the xrandr application. I just spent a day cleaning things up a bit (it could use a more cleanup, especially splitting the source across multiple files). I added the ability to position outputs relative to one another and also a global ‘—auto’ mode which turns on all connected outputs and turns off all disconnected ones. With that, I hooked up a new global hotkey in metacity to run ‘xrandr —auto’ so I can just plug in a new monitor, hit the key and expect it to light up. There are still a few more tasks to take care of here:

On the driver front, we’re doing development in an unusual fashion (and we may regret it if we’re not careful). As our goal is to produce a single driver binary that will run against both 7.2 and newer X servers, we cannot depend on having any new functions in the server binary.

To avoid server dependencies, we’re building a bunch of new driver-independent functionality as if it were part of the server binary and linking that into the Intel driver. As much of this code started life as driver-dependent explorations of how to make RandR 1.2 work, it isn’t quite as independent as we’d like. We’ve done some piecemeal attempts to make it look a little better, but the result is actually fairly ugly at present with a mish-mash of function naming schemes and remnants of driver-dependencies in odd places.

Dave Airlie has been copying this code into the nouveau driver and hooking up RandR 1.2 support there. That’s great as it gives us a chance to make sure these new interfaces are right for more than one driver. I’m hoping someone will also take a look at how this will work in the radeon driver; with that, we’ll have a reasonably broad experience with the new interfaces and should be able to avoid nasty surprises down the road. Of course, drivers will still be able to completely by-pass this layer within the server, so at least we won’t make fixing it impossible, only painful.

I’ve also recently started on some xorg.conf support for output configuration. Right now, this consists of the ability to associate a Monitor section in the config file with each output of the device. From the monitor section, you can add new mode lines, specify DPMS support, override sync ranges and set a preferred mode.

We’ve also started adding some monitor ‘quirks’ to the EDID detection code; I got a trio of monitors from France that all had various incorrect data in their EDID blocks, including one monitor which reported a preferred mode of 640x350. I’d like to keep adding more quirks as we find broken monitors; that lets everyone share the same fixes. Of course, with the xorg.conf support now available, you can override most of the EDID data and work around things at run-time, but if you do have to do this, please submit a bug report and attach the broken EDID data (xrandr —prop will print it out for you).

Aside from general infrastructure cleanup, we’ve still got some features missing from the implementation that we’d like to play with:

And, of course, it would be fun to see some applications starting to use this, in particular KDE and Gnome both have screen size setting applets which could see some significant enhancements now.

Posted Sun Dec 31 23:53:05 2006
Startup RandR Configuration

Well, RandR 1.2 work is progressing apace; I can now reconfigure the X server in some fairly dramatic ways. I now regularly use the 1600x1200 monitor at home as extra desktop space for my laptop, growing the X root window to cover both monitors.

But, what I’ve lost in the process is any ability to configure the system at startup time. This is rather unusual; the system is now far more flexible through the RandR protocol than through the configuration file. Getting things set right at startup time seems important, as that will avoid flashing monitors as they change modes and other possible issues.

I started the process of allowing startup-time configuration by making the RandR 1.2 code permit object creation before the Screen objects were created. This allows the driver to create the necessary RandR 1.2 structures and use those to control the configuration process. This would work, except that I’d like the same driver to work in the absence of RandR 1.2 in the core server.

Back to the drawing board.

What I’m doing now is creating some new structures that map to the RandR 1.2 structures but which are hw/xfree86 specific and which don’t depend on RandR 1.2 in the core server. With the goal of eventually moving these into the hw/xfree86 portion of the server, these structures provide all of the RandR 1.2 semantics using smaller driver-specific methods. The code for this new work can then use these new data structures to configure the server at startup time, as well as on-the-fly at runtime using RandR 1.2.

The two primary data structures are the xf86CrtcRec and the xf86OutputRec:

struct _xf86Crtc {
    /**
     * Associated ScrnInfo
     */
    ScrnInfoPtr     scrn;

    /**
     * Active state of this CRTC
     *
     * Set when this CRTC is driving one or more outputs 
     */
    Bool        enabled;

    /**
     * Position on screen
     *
     * Locates this CRTC within the frame buffer
     */
    int         x, y;

    /** Track whether cursor is within CRTC range  */
    Bool        cursorInRange;

    /** Track state of cursor associated with this CRTC */
    Bool        cursorShown;

    /**
     * Active mode
     *
     * This reflects the mode as set in the CRTC currently
     * It will be cleared when the VT is not active or
     * during server startup
     */
    DisplayModeRec  curMode;

    /**
     * Desired mode
     *
     * This is set to the requested mode, independent of
     * whether the VT is active. In particular, it receives
     * the startup configured mode and saves the active mode
     * on VT switch.
     */
    DisplayModeRec  desiredMode;

    /** crtc-specific functions */
    const xf86CrtcFuncsRec *funcs;

    /**
     * Driver private
     *
     * Holds driver-private information
     */
    void        *driver_private;

#ifdef RANDR_12_INTERFACE
    /**
     * RandR crtc
     *
     * When RandR 1.2 is available, this
     * points at the associated crtc object
     */
    RRCrtcPtr       randr_crtc;
#else
    void        *randr_crtc;
#endif
};


struct _xf86Output {
    /**
     * Associated ScrnInfo
     */
    ScrnInfoPtr     scrn;
    /**
     * Currently connected crtc (if any)
     *
     * If this output is not in use, this field will be NULL.
     */
    xf86CrtcPtr     crtc;
    /**
     * List of available modes on this output.
     *
     * This should be the list from get_modes(), plus perhaps additional
     * compatible modes added later.
     */
    DisplayModePtr  probed_modes;

    /** EDID monitor information */
    xf86MonPtr      MonInfo;

    /** Physical size of the currently attached output device. */
    int         mm_width, mm_height;

    /** Output name */
    char        *name;

    /** output-specific functions */
    const xf86OutputFuncsRec *funcs;

    /** driver private information */
    void        *driver_private;

#ifdef RANDR_12_INTERFACE
    /**
     * RandR 1.2 output structure.
     *
     * When RandR 1.2 is available, this points at the associated
     * RandR output structure and is created when this output is created
     */
    RROutputPtr     randr_output;
#else
    void        *randr_output;
#endif
};

The hardware is manipulated through driver-specific functions contained in the xf86CrtcFuncsRec and xf86OutputFuncsRec:

typedef struct _xf86CrtcFuncs {
   /**
    * Turns the crtc on/off, or sets intermediate power levels if available.
    *
    * Unsupported intermediate modes drop to the lower power setting.  If the
    * mode is DPMSModeOff, the crtc must be disabled, as the DPLL may be
    * disabled afterwards.
    */
   void
    (*dpms)(xf86CrtcPtr     crtc,
        int             mode);

   /**
    * Saves the crtc's state for restoration on VT switch.
    */
   void
    (*save)(xf86CrtcPtr     crtc);

   /**
    * Restore's the crtc's state at VT switch.
    */
   void
    (*restore)(xf86CrtcPtr      crtc);

    /**
     * Clean up driver-specific bits of the crtc
     */
    void
    (*destroy) (xf86CrtcPtr crtc);
} xf86CrtcFuncsRec, *xf86CrtcFuncsPtr;


typedef struct _xf86OutputFuncs {
    /**
     * Turns the output on/off, or sets intermediate power levels if available.
     *
     * Unsupported intermediate modes drop to the lower power setting.  If the
     * mode is DPMSModeOff, the output must be disabled, as the DPLL may be
     * disabled afterwards.
     */
    void
    (*dpms)(xf86OutputPtr   output,
        int         mode);

    /**
     * Saves the output's state for restoration on VT switch.
     */
    void
    (*save)(xf86OutputPtr       output);

    /**
     * Restore's the output's state at VT switch.
     */
    void
    (*restore)(xf86OutputPtr    output);

    /**
     * Callback for testing a video mode for a given output.
     *
     * This function should only check for cases where a mode can't be supported
     * on the pipe specifically, and not represent generic CRTC limitations.
     *
     * \return MODE_OK if the mode is valid, or another MODE_* otherwise.
     */
    int
    (*mode_valid)(xf86OutputPtr     output,
          DisplayModePtr    pMode);

    /**
     * Callback for setting up a video mode before any crtc/dpll changes.
     *
     * \param pMode the mode that will be set, or NULL if the mode to be set is
     * unknown (such as the restore path of VT switching).
     */
    void
    (*pre_set_mode)(xf86OutputPtr   output,
            DisplayModePtr  pMode);

    /**
     * Callback for setting up a video mode after the DPLL update but before
     * the plane is enabled.
     */
    void
    (*post_set_mode)(xf86OutputPtr  output,
             DisplayModePtr pMode);

    /**
     * Probe for a connected output, and return detect_status.
     */
    enum detect_status
    (*detect)(xf86OutputPtr output);

    /**
     * Query the device for the modes it provides.
     *
     * This function may also update MonInfo, mm_width, and mm_height.
     *
     * \return singly-linked list of modes or NULL if no modes found.
     */
    DisplayModePtr
    (*get_modes)(xf86OutputPtr  output);

    /**
     * Clean up driver-specific bits of the output
     */
    void
    (*destroy) (xf86OutputPtr   output);
} xf86OutputFuncsRec, *xf86OutputFuncsPtr;

Right now, I’ve just hacked up the Intel driver internals using these new structures and have left the implementation a tangled mess with driver-specific, randr-specific and other code all tied together. Obviously this is not a long-term plan, but I want to change the data structures first, then split the code apart.

At this point, I’ve managed to get the Intel driver to compile with the new data structures, so I’ll first get it working then start cleaning up the implementation so that driver-independent code is clearly separated and performing as much of the work as possible.

Once driver-independent code is in place, I will add startup configuration to the driver-independent code. I’ll probably use the existing Radeon configuration file options as much as possible, and probably accept the Intel configuration file options as well. While those do not expose the full capabilities of the RandR 1.2 extension, I suspect they’re sufficient for most users.

It’s a long way around to get startup configuration for the new capabilities in the Intel driver, but I’m hoping it will allow us to create a common configuration language for all of the drivers and also remove the current dependence on RandR 1.2 from the Intel driver for much of this functionality.

Posted Sun Nov 26 17:28:51 2006
Autoconf in a RandR World

While busily rewriting the RandR extension and the Intel driver to match, we decided to tackle another major issue, what to do when the driver isn’t given any instruction in the config file about how to light up the screens. And, beyond that, how to make that configurable in reasonable ways while making it driver-independent.

The current scheme is a mish-mash of crufty ancient code and magic driver-specific hacks. The ancient driver-independent code doesn’t understand that a single video card can have multiple monitors, so the normal configuration mechanisms are mostly harmful. For the Intel driver, the mode that you specify in the configuration file is almost, but not entirely, ignored.

In the bad old days of BIOS-based mode selection, it would just use the specified mode to try and match some mode present in the BIOS. In the brave new world of native mode setting, we can at least use the mode provided directly, assuming the output is capable of using it. In either case, all screens were programmed with the same mode (more or less). Not very useful when you have a 1024x768 internal panel and a 1600x1200 external monitor.

Almost all of the i830 and later Intel chips have two “pipes”. Each pipe can be connected to a variety of “outputs” (where an output is effectively a connector, like a local LCD panel, or an external VGA connector). The “pipes” are important here because it takes a pipe to hold a specific mode, the pipe fetches data from the frame buffer and sends it to the outputs with the timing specified by the mode.

Now, the weird thing is you can sometimes connect multiple outputs to a single pipe. But, when you do that, each output gets exactly the same mode and sees exactly the same pixels out of the frame buffer. Plus, there are other restrictions, like you can’t share a pipe with the local LCD panel or the TV output on the 945. Whatever. We mostly ignore this at present because it’s not that useful, and it’s a pain to think about. Of course RandR supports it and will expose it to the user when it can work, but that’s not often.

Ok, with that brief diversion into the oddities of the Intel graphics chip, let’s get back to configuration.

To allow the user to customize how modes were set, there were three parameters in the config file:

Option  "Clone"     "yes"
Option  "CloneRefresh"  "60"
Option  "MonitorLayout" "LFP,CRT"

The “Clone” option directed the driver to turn on two of the outputs. The “CloneRefresh” option specified the vertical sync rate for the “other” monitor. “MonitorLayout” gave the user precise control over which outputs are connected to which pipes.

The current BIOS-based driver on the master branch adds a bunch more:

Option  "MergeFB"   "yes"
Option  "MetaModes" "1024x768"
Option  "SecondHSync"   "80-130"
Option  "SecondVRefresh "50-75"
Option  "SecondPosition "RightOf"
Option  "MergedXinerama" "yes"

As you might guess, these all combine to let you place the second monitor somewhere other than right on top of the first monitor. Useful when you have two monitors on your desktop.

To confuse you further, the Radeon driver uses a different set of options to perform exactly the same function. Cool, huh? Even more fun is that these two drivers have completely different semantics of how to interpret the lack of these options. Makes building a configuration tool fairly challenging, and makes the chances that the user will get “random” results high.

So, the first thing to realize is that RandR 1.2 makes all of these things entirely configurable. But, not until you have the server running and can connect and X client. Bummer. With that, you’d have to put the desired configuration into a startup utility and you’d watch the screen flash a couple of times as you were logging in. Fun for some, but annoying for most.

Given that RandR has sufficient power to configure things after the server has started, and that this is expressed in a driver-independent fashion, it seems sensible to figure out how to use that information at startup time to make better choices and ease customization.

The first piece I’ve done is to replace the existing default mode selection logic with something a bit fancier and (I hope) more generally useful. After that, I’ll write up some replacement configuration options and use those to mutate this configuration.

For the initial default configuration (used when no options are present in the configuration file), I made some simplifying assumptions:

Given that I wanted to present the same data on every monitor, I started by picking a single monitor to control the size of the screen. For this, the code first looks for a monitor with a preferred mode; those are usually either the laptop LCD panel or an external DVI-connected LCD monitor. If no such monitor is present, the code picks a random monitor and selects a mode that will present data at about 96dpi. I think this makes more sense than picking the highest supported resolution as CRTs often advertise support for incredibly high resolutions that end up fuzzy and dim. Better to just pick a reasonable size by default and let the user change it after login.

Once the first mode is selected, all of the other monitors are set to modes that are close to that size.

Finally, the list of monitors is used to compute the maximum screen size that should be permitted. Yes, we’re still stuck with allocating the frame buffer at server init time. This will, eventually, go away, but that’s related to rendering infrastructure which we’re ignoring this week. In any case, the maximum size is computed by figuring out how much space is needed to place all of the known monitors side-by-side. For outputs which don’t have any monitor connected, we just pretend that they’ll max out at 1600x1200. The end result is that there shouldn’t be any limits on which mode combinations can be used in clone or mergefb mode.

Right now, all of this code is down inside the Intel driver. I will pull the DIX-level functions up to the RandR code. What remains is to decide how to make the remaining code driver independent and (eventually) move it to the xf86 common layer where it can be shared across multiple drivers.

All of this work is available on the modesetting branch of the Intel driver when built against the randr-1.2-for-server-1.2 branch of the X server.

Posted Thu Nov 16 22:32:14 2006
Hacking 965 modesetting

What a pleasant weekend I’ve had; no nasty meetings or politics, just some good clean hacking fun.

I spent a few hours each day poking at the modesetting branch and getting it working on my shiny new 965-based desktop system. Eric had been working on the SDVO support and gotten that working, so I figured I’d at least get the CRT output working, which seemed like an easy enough task.

The BIOS-based modesetting code was already working, so I knew the hardware worked correctly. But, our existing CRT modesetting code was producing a nice black screen. I love modesetting code—the most common error indication is just ‘sadness’; the monitor remains black and indicates that there is no signal present on the wire.

I poked around looking at what the BIOS did and how that differed from what the modesetting driver did and made a small bit of progress thanks to an accident. I left the video clock programmed for 1600x1200 and then asked the server to display a 640x480 mode. One would expect this would leave the video mode running far too fast. But, to my surprise, the monitor happily locked onto this and reported that it was running at 85Hz. Weird. A bit of math and I discovered that somehow the clock was getting divided by 4 somewhere. Sure enough, simply multiplying the real clock by 4 left me with stable modes across a wide range of sizes. Unfortunately, not including the high-resolution modes loved by our users; those were now out of the reach of the programmable clock. But, it made the question of where the problem was a bit clearer—something was wrong with the clock register programming.

The next accidental discovery was that in pure clone mode, with both CRT and DVI connected to the same pipe, I also managed to see a working mode for the lower resolutions (640x480 and 800x600). Higher resolutions still failed. It’s important to note that the DVI connector is reached through the SDVO port, which must run at high frequencies. Low frequency modes are padded with junk and clocked faster to keep the bus stable. For 640x480 and 800x600 modes, the clock is multiplied by 4.

While I had looked at the register results for the BIOS mode setting, I hadn’t seen it in action. Fortunately, action shots were available.

Dave Airlie hacked up Matthew Garrett’s vbetest program and left it on here. This fine piece of work executes the video bios and monitors all of the video device register accesses it performs. Watching it live allowed me to see precisely which registers it thought were related to clock timing. I noticed that it set the DPLLAMD register when programming a pure CRT mode, something which seemed a bit odd to me as that register has rather vague documentation about UDI and SDVO outputs. But, one small sentence did mention CRT multipliers of some sort, so I figured I might as well give it a try. Stealing the same setting that it used, the CRT now locked nicely using the normal un-multiplied clock frequency, and worked across the whole range of modes.

Thanks Dave, thanks Matthew.

The final adventure for the weekend was to discover why my screen image was getting corrupted when I used a large frame buffer. The effect was quite mystic—contents written to one location in memory would be duplicated to many locations on the screen. I thought it might be fifo size issues, but exploration with an application window and a window manager demonstrated that the corruption was not just on the screen, but actually visible to the GPU and CPU as well, and appeared to be caused by multiple GATT entries mapping to the same physical page. I eventually disocovered that just skipping the first 256K of video memory and not using fixed the problem; I haven’t looked into this in more detail, but it seems likely that those areas are actually mapped to the GATT table itself, and using them for other things caused the symptoms observed above. For now, I’ve just made the driver skip over that amount of memory; it fixes the problem I had.

I also spent a bunch of time shrinking the driver to eliminate a bunch of redundant state. We’re planning on moving all of the initial frame buffer configuration to common RandR code shared across drivers, so I went ahead and disabled the Intel-specific code in the driver. Yes, this means that there’s no way to configure screen layout when you start the server, but you can use the RandR extension afterwards to make it do whatever you like. It’s temporary, eventually we’ll get the common code working. Probably the first time X has had this kind of state which can only be set through the protocol and not in the config file.

As Eric made a merge that broke things before he took off for the weekend (strong work, Eric), I’ve placed my work on a separate branch for merging this week sometime. Everything here is on the modesetting-keithp branch in the xf86-video-intel repository.

Posted Sun Nov 5 20:02:27 2006
Nokia 6131 Synchronization

While in Shanghai a few weeks ago, I picked up a Nokia 6131 telephone. Prices there are quite reasonable, and I was fed up with my Motorola Razr (the worst phone ever invented, as far as I can tell). Friends familiar with Nokia phones suggested I might prefer a Series 40 phone, which while less feature-rich than the Symbian models tend to run quite a bit faster. Shopping for telephones was quite easy; every model I’d ever heard of was available from multiple vendors. Someday maybe the US will rediscover the simple joy of providing what the customer wants.

In any case, once we had switched the 6131 from Chinese to English, I slipped my sim card into place and was happily conversing and taking pictures with the new toy.

Of course, one of the big goals in moving from the Razr to the Series 40 was to get synchronization between my Evolution contacts and calendar and the applications on the telephone. Not since my Treo 600 had I been able to see my schedule and access my whole phone book from my cell phone.

The 6131 putatively supports SyncML, but attempts to get the opensync SyncML plugin working ended in failure after much gnashing of teeth. The plugin would load, but the synchronization process would just hang without transferring a single phone number. Sigh. Clearly this phone doesn’t quite follow the same interpretation of the syncml standard as the opensync plugin.

Finally, I started looking around for alternatives when I discovered that the Gnokii folks had created a gnokii opensync plugin using their Nokia-specific backend. Shockingly, there was no Debian package, so I downloaded the latest source and built it. Surprisingly enough, it appears to work fine.

Well, almost fine. I’ve got ‘a few’ contacts in my address book, and the simplistic gnokii code for locating a free address book slot was reading every address book entry looking for a free spot. Oddly, it spent a lot of time re-reading address book entries. That was easy to fix at least.

Next, I discovered that the gnokii sync code wasn’t dealing with finite repeating events, events which repeat for a while and then stop. Every repeating event would go on forever. I use repeating events for conferences by setting them to repeat every day for the length of the conferences. I go to a few conferences each year, so with 10 years of conference history, I had several repeating events occurring ‘today’. I tried to make this work correctly; the phone appears to have a notion of ‘occurrences’, which I was guessing meant a count of repeat events. I didn’t manage to get this working, so I kludged around it and set these events to non-repeating, which at least removes them from view for the moment.

Of course, hacked versions of gnokii and gnokii-sync are available from my git repository.

Ah, life with a functioning telephone again. We’ll see how long this lasts.

Posted Tue Oct 31 00:54:45 2006

All Entries