Ok, so I didn’t get a lot of time for coding last week. And, this week there’s OSCON, so coding time will be short again. I figured I should spend some time writing up a brief report about where X output is at today.
Output Hotplug
I think this stuff is fairly solid these days, although we don’t have much in the way of auto-detection of monitor connect/disconnect. There are two reasons here:
The hardware notifies the operating system via an interrupt. Given mode setting code in user space, dealing with interrupts is a huge pain and hence hasn’t been hooked up yet (see below).
Analog outputs (VGA, TV) do detection using impedance changes in the output signal path. This means we have to keep them active if we want to detect a connection. That takes a lot of power (about 1W to light up the VGA output without a monitor connected). What we could do is detect when a monitor was unplugged; that’s free.
There are a few other random improvements that are coming soon, like CEA additions to the EDID parsing code. These are additional data blocks that follow the standard EDID data and are used for ‘consumer electronics’ devices. Supporting these should make more HDMI monitors ‘just work’.
Initial Mode Selection
Detecting connected monitors is fine, but one thing we haven’t really solved is what to do when you have more than one connected when the server starts. My initial code would pick one ‘primary’ monitor, light that up at its preferred size and then pick modes for the other monitors which were as close as possible to the primary monitor size without being larger. Obviously, I liked that as it meant my laptop always came up looking correct on the LVDS and my external VGA would show most of the screen.
However, this was reported to confuse a lot of users. I can imagine that starting the X server with one of the outputs connected but not turned on would make for some ‘interesting’ support calls. So, now the X server picks a mode which all outputs can support and uses that everywhere. Sadly, this means that my laptop panel gets some random scaled mode (usually 1024x768) which looks quite awful.
I think we need something better than either of these choices, but I’m not quite sure how it should work.
Kernel Mode Setting
A bunch of people, including Jesse Barnes and Dave Airlie, have been hacking to move the output configuration code into the kernel. This will solve lots of little problems, like how to display kernel panic messages, and how to deal with interrupts for output hotplug.
This code is up and running fairly well these days, but depends on a kernel memory manager to deal with frame buffers. The integration of GEM into the kernel is blocking this work, but I’m hopeful that this will be sorted out in the next couple of weeks.
GEM — the Graphics Execution Manager
Work here was stalled for a few weeks while we sorted out memory channel interleaving issues. Now things are moving again, and we’re working on getting it stable enough to merge into master. That means fixing a few more critical bugs that the Intel QA team has identified.
One of these bugs is that our GL conformance tests weren’t working right; that turned out to be caused by tests reading back data from the frame buffer one pixel at a time. Our read-back path passed through the GEM memory domain code to pull objects back from GTT space to CPU space. That meant flushing the front, back and depth buffers from the CPU cache. With each of those at 16MB, reading a single pixel took long enough that the tests would time-out. Increasing the timeouts to ‘way too long’ is making them run, but tests which would complete in a few hours are now taking days.
We’ve got two different plans for fixing the read-back path:
Use pread to access precisely the data we need. This would involve flushing a single cache line for the tests above.
Mapping the back buffer through the GTT. This would eliminate the need to clflush anything as the GTT mappings are write combining and so reads bypass the cache.
Eric is working on the former, and I’m working on the latter. More news later (this week?) when we see which one wins.
Composite Acceleration
With Owen Taylor’s change to the glyph management code in the server, Eric and Carl were able to change the driver to batch multiple glyph drawing operations into a command buffer. Once Carl had this working, we went from 13000 glyphs/sec to 103000 glyphs/sec. Obviously we’re hoping for even larger improvements as a pure software solution is well over 1 million glyphs/sec. Even still, 103000 glyphs/sec is enough to make my desktop vastly more usable, and using the software path means losing a lot of other useful acceleration.
DRI2 — Redirected Direct Rendering
Right now, direct rendered GL applications (which is the fastest way we can do GL at present) get drawn to a giant screen-sized back buffer and then copied from there to the screen at swap buffers time. Because everyone shares the same back buffer, you get to clip your drawing as if you were drawing directly to the screen. While this normally doesn’t matter much (aside from some performance costs associated with lots of clip rectangles), when you’re running a compositing manager (like compiz), the 3D applications end up ignoring the per-window offscreen pixmap and spam their output directly to the real frame buffer.
DRI2, written by Kristen Høgsberg, solves this by changing how direct rendering works and giving everyone a private back-buffer to draw to. Now, at buffer swap time, that private back-buffer can be copied to the window’s pixmap and compiz is happy.
This work has been around for a few months, but depends on a TTM-based memory manager. That dependency isn’t very strong, and krh has promised to fix it shortly. Once that’s done, getting the GEM driver to support DRI2 won’t take long, and we’ll have our fully composited desktop running. With luck, that’ll happen before September.
Final Words
As you can see, we’re nearing the end of our long X output rework saga, with most of the pieces falling into place in the next month or two.
Posted Mon Jul 21 18:25:43 2008With our channel-interleaving mostly sorted out, Eric and I spent a short time figuring out what to do about it all. The important thing we learned is that the hardware has two modes: linear and tiled. The CPU always accesses memory in linear mode, while the GPU can either use linear or tiled mode. What is important here is that ‘linear’ mode uses the same interleaving, whether from the GPU or CPU. So, we can only get into trouble if we use tiled mode from the GPU and linear mode from the CPU.
We came up with a fairly simple plan to resolve this issue:
Figure out how to automatically detect the precise interleaving configuration of the system. There is a MCH BAR holding the data, and so far most (but not all) machines appear to report what we expect.
Add an IOCTL to propose tiling to the kernel, letting the kernel reject the proposal. This gives us a hook to pass back whatever channel interleaving is necessary. When the memory configuration cannot be supported in tiling mode, the call fails and user mode reverts to linear allocations. This will reduce performance, so we want to do it as little as possible.
A side benefit here is that we get a reliable way of saving tiling data for each buffer — until now, we’ve stored that in the sarea for front/back/depth, but hadn’t any general plan.
For now, fail tiling requests on hardware that uses bit 17 in linear mode, or hardware with an “L” shaped memory configuration. Of course, it would be nice if we could find an ”L”-shaped machine to test with.
Add return data to this ioctl to report what address bits to stir into the channel select (bit 6) value. Now user-space needn’t change when the hardware configuration does.
Provide a way to map buffers through the GTT using the fence registers to de-tile them. This will make tiled buffers appear linear to the CPU. This is required to let us tile X pixmaps as we would otherwise need to use wfb instead of fb for software rendering. We’ve wanted to do this for a while; it’s nice to have the option of using GTT mappings in any case — writes are less hassle as the WC mapping doesn’t require explicit cache flushing. Giving applications the option seems like the best way forward. Of course, when the GTT fills, these mappings will go away. When the application touches a page, it will fault the object back into the GTT.
Maybe someday provide enough information back to user space to deal with page-level interleaving information. For bit-17 configurations, we’d need to report back the physical bit-17 information for each page. For “L”-shaped configurations, we’d have to return back whether each page was interleaved at all. Attempting to make this work with paging seems really hard though — you’d have to create some kind of atomic section where user-space would read the interleave information, compute pixel addresses and access the data. Icky.
On Wednesday and Thursday, Jesse Barnes, Zou Nan Hai and I got to attend the 2008 Intel Gfxcon. Put on by Intel GDG (no, I don’t know what that means, but it’s the Intel integrated graphics group), it was held in Folsom, CA. The trip down on the Intel shuttle was uneventful, but on arrival I found the 42° air full of smoke from wild fires. My plan to bicycle from the hotel to Intel suddenly seemed like it would be a lot less fun.
The conference was huge fun. As usual, meeting people in person is always better than via email or even on a teleconference. After working at Intel for nearly three years now, I’m starting to feel a bit more a part of the organization and less of an outsider. Linux continues to gain mindshare within the company as it gains visibility in our customers’ products.
I got to present our current GEM work to a fair-sized crowd, including several people from our Windows driver development team. I was interested to hear what they thought about the architecture and was pleased to learn that a lot of what we’re doing is similar to how the Vista driver works. While we can’t share source code, it is at least nice that we can share ideas about how best to drive the hardware at the lowest levels of the system.
That evening, Jesse, Nan Hai and I managed to find decent steak-and-potatoes, but our attempt to locate gelato ended in near-failure — Google led us to a mini-mall in a neighboring town. Upon failing to locate the expected restaurant, we enquired in an Italian place who first told us that the gelato place had closed five years earlier, and then convinced us that they could provide gelato. Served freezer-burnt ice cream, we quickly left for our respective lodgings.
The next evening we found acceptable California-style Mexican food. As usual, portions were large enough to push any thought of desert from our minds. We lumbered back to Nan Hai’s hotel room and hacked for several hours, although the free wifi there left a lot to be desired. Then, we chatted about how to get rid of tearing in textured video.
For vblank synchronized textured video, I’m hoping we’ll be able to queue the update to the kernel and have it perform the necessary blts. This would mean interrupting the current command stream and switching over to a separate stream. It probably means using separate hardware contexts, which would be a good thing in any case as we could eliminate the per-batch-buffer configuration that the 3D driver currently performs. Work on this will need to start with multiple hardware context support, then move on to interrupting the ring and then figuring out how to manage the blts along with clip list changes etc.
Posted Mon Jul 14 22:19:41 2008Eric and I have been busy hacking away at GEM, the Graphics Execution Manager. GEM is a memory and command ring manager for Intel integrated graphics. GEM itself is working out quite well; we haven’t found any terrible surprises in that area. I thought I’d take a few minutes and write up some of the things we’ve discovered and what we’re doing.
Kernel Patches
First off, I’ve published the GEM kernel patches so that anyone can give this code a try. The first patch was so trivial (exporting shmem_file_setup) that I didn’t bother, but the patch to export shmem_getpage is quite a bit longer as it also has to export an associated enum.
There’s also a patch for the agp driver which re-writes the GATT on resume. That one isn’t GEM-specific, but as we assume the GATT is preserved across VT switches, it’s necessary to make GEM survive suspend/resume. It has also been accepted into -mm and should land upstream sometime soon.
Writing data to the GPU
One of the central ideas in GEM is the recognition that cache management plays a huge role in moving data between the CPU and GPU. Because the CPU and GPU are not cache coherent, applications must either use uncached writes from the CPU or explicitly flush the CPU cache to get data transferred. There are several different ways of doing uncached writes:
Uncached page table entries. The requirement here is that all mappings to this page must be uncached, so you can’t simply create an uncached mapping when you want writes to be flushed out. The page must be flushed from all CPU caches at allocation time. Worse — the TLBs of all CPUs must be synchronized so that everyone agrees on the caching mode for every page. While flushing the CPU caches isn’t terrible as we can use clflush, the TLB flush requires an IPI, making the allocation of uncached pages very expensive.
Writing through a suitable MTRR-mapped address. This is how we’ve always done writes to the GPU in the past — the graphics aperture is covered by write-combining MTRR entries so that writes will be sent to memory. This requires that the destination pages be mapped through the GATT so that they appear under the graphics aperture, which (again) requires that the page contents be flushed from CPU caches. However, TLB entries needn’t be flushed as we aren’t changing any PTEs.
Non-temporal stores (movnti, movntq, etc). The kernel already uses these when copying data around to avoid filling the cache with useless data. However, non-temporal stores don’t actually guarantee that data won’t end up sitting in the cache. In particular, if the destination is already sitting in a cache line, then the store will not force that cache line to be flushed. So, while this avoids filling the cache with a lot of additional data, it doesn’t provide the necessary guarantee that data will be visible to the GPU.
Using clflush. This makes sure all CPU caches are flushed and the data written to memory. clflush isn’t cheap, but as it uses the cache coherence protocols, it need only run on one CPU. Combine this with non-temporal stores and you get a fairly cache-friendly mechanism without the cost of uncached page allocation.
If the kernel offered a cheap way of allocating uncached (or write combining) pages (presumably by constructing a pool of pages ready for uncached use), that might be interesting. However, uncached writes also means uncached reads, and sometimes we do read data back from the GPU. So, we’re ignoring this option at present.
Writing data through a write-combining MTRR has always worked reasonably well; there were older processors for which WC writes were slower than WB, but that (fortunately) is no longer true. The big drawback in using the MTRR is that we must allocate a portion of the limited GATT for objects that we want to access in this way. If we want to swap these pages, or need GATT space for other objects we would have to remove them from the GATT. The performance issue here (again) is that reads through a WC mapping are uncached, which means dramatically lower performance.
So, what does all of this mean in the GEM context?
First off, we want to try and treat CPU->GPU data transfers as I/O instead of memory-mapping. This means knowing precisely what data are being moved and when that happens. If we map objects with caching enabled to avoid GATT fun and improve read performance, then when those objects are moved back to the GPU, we must assume that they are entirely dirtied and flush the whole object. Treating this as I/O means having the kernel do all of the writes, which allows all kinds of flexibility on mapping.
Secondly, for operations that can’t easily be treated as I/O, it means making explicit choices about where to map objects. When reading an object from the CPU (as when using texture data), we certainly don’t want to use an uncached mapping — if that object isn’t written by the CPU, then we needn’t flush when switching back to the GPU either. However, for rendering targets as large as the frame buffer, mapping them cached means performing an enormous clflush sequence when moving back to the GPU, so we probably need to make the GATT-based WC mappings work. Currently, GEM doesn’t manage this — all objects are mapped cached, so software fallbacks end up doing a lot of cache flushing.
Using the I/O model to write data from user space into buffers for the GPU leaves us with some flexibility in the kernel implementation. We’ve tried two different mechanisms and I’m working on a third:
Use the existing pwrite code from the shmem file system. Follow that with calls to clflush when the object is mapped to the GPU. This works out very well when the batch buffers were full, but partially filled buffers end up causing unnecessary clflush calls. Also, the clflush requires an extra kmap_atomic/kunmap_atomic pair.
Map the object to the GATT and then use kmap_atomic_prot_pfn to map pages transiently into kernel space. This gives us WC write performance, eliminating any need to use clflush. Performance is improved, but it abuses kmap_atomic_pfn — a function which is only really supposed to be given physical memory pages. On kernels with CONFIG_HIGHMEM set, it works out fine, but without that, you get a garbage PTE. Performance here is quite a bit better, eliminating flushing from profiles.
Hand-code the pwrite function to map the pages, copy the data and flush the cache all in one step. I’m hopeful that this will end up as fast as the GATT-based scheme, but avoid the abuse of the kmap_atomic_pfn function.
The first scheme exposed the flushing as an expensive operation; profiles for typical games would have flushing taking 5-7% of the CPU. The second scheme eliminated that, raising performance and lowering CPU usage. We’ll see if the third scheme is successful; if not, we’ll have to lobby the kernel developers to give us a supported way of transiently mapping I/O devices from kernel space.
Tiling and memory channels
Mapping graphical objects in a linear frame buffer where the data for each scanline is neatly arranged together in memory is an obvious representation for the data; it makes constructing scan-out hardware easy, and also makes writing software rendering code easier as well. Unfortunately, graphical objects generally span adjacent portions of many scanlines. Accessing memory in this order generally runs counter to memory architectures; a vertical line will end up writing one pixel in one cache line of one page. You end up spending a huge amount of time reading/writing cache lines and refilling TLB entries.
The usual solution to this is to have a single page hold pixels for multiple scanlines in the same region of the screen. Tiling the screen with these blocks of pixels provides dramatic performance improvements (we see about a 50% performance improvement from tiling the back buffer). Intel hardware supports two different tiling modes. With ‘X’ tiling, each page forms a rectangle that is 512 bytes wide by 8 scanlines high. ‘Y’ tiles are 128 bytes wide by 32 scanlines high.
A separate, but related issue is dealing with multiple memory channels. To see maximum memory bandwidth, the system needs to interleave access between memory channels. The memory system is arranged so that successive cache lines come from alternate memory channels, which means that address bit 6 ends up being the ‘channel select’ bit. This is related, because tiled graphics breaks the assumption about sequential access — walk down a tiled buffer and you would hit the same memory channel each time.
To fix this, the hardware actually modifies address bit 6 using other portions of the address. For X tiling, it xor’s in bit 9 and 10 of the address when computing bit six; this means that vertically adjacent pixels are always in alternate memory channels. Y tiling uses only bit 9, but the pixels are already stirred around in that format enough that this one bit suffices.
The CPU doesn’t share in this particular adventure, so when it accesses these objects directly (not through the GATT), it sees things mixed around.
Of course, sometimes the hardware doesn’t bother swizzling bit 6 like this; if you have only a single memory channel, it doesn’t help. But, neither does it hurt, so some hardware will swizzle even in this case. We haven’t found any registers that tell us when the swizzling is going on.
Not to be left out of the bit 6 fun, the CPU-facing memory controller also improves interleaving by mixing bits up. It can either stir in bit 11 or bit (uh-oh) 17. At least this behavior is documented and visible in a register visible to the CPU. Bit 11 is workable; we just stir that into the mix when computing bit 6 to unswizzle before the memory controller re-swizzles and things work out fine. Bit 17 is problematic. It’s not a virtual address bit, it’s a physical address bit. Which means that the physical memory layout of data stored in RAM depends on where in memory a page is sitting. Move the page around so that bit-17 changes and the data will flip channels.
Of course, as the GPU does its own bit-6 swizzling for tiled objects, it doesn’t bother with the CPU memory swizzle. Which means that tiled data written by the CPU and read by the GPU will appear to flip around, depending on where in physical memory that data resides.
All of these bit-6 adventures are holding up GEM development at present; software fallbacks reading or writing to tiled objects are quite broken on many machines, and the way they’re broken depends on how the CPU and GPU memory controllers are set up.
We already have code that does the GPU swizzling and only need to add auto-detection to know when to use it. But the bit-17 CPU swizzling may cause some significant problems. First off, we’d have to make all CPU access to tiled objects go through the GATT, hurting read performance and complicating our mapping code — user mode doesn’t know anything about physical addresses and so couldn’t swizzle. Secondly, we would have to find some way to ensure that bit 17 of all tiled pages didn’t change across swap operations (as swapping will read and write through the CPU swizzle). That would either mean pinning tiled objects in memory (ouch), or hacking up the kernel memory manager to add a very strange constraint on page allocation, or swizzle pages before mapping them to the GATT.
Posted Fri Jul 4 13:25:03 2008I mentioned in passing during my Linux.Conf.Au talk that we were looking at moving portions of the video drivers into the kernel. Others, including Alan Coopersmith have started to chime in on this note and I thought I should write a bit more about what I was thinking.
At least a few of the goals are as follows:
- BIOS-free Suspend/Resume support
- Text-mode panic
- Flicker-free graphical boot
- Fancy animated user switching
A significant non-goal here is an in-kernel unified API for graphics acceleration. Our current high-level architecture for rendering is in good shape, with user-mode filling a ring buffer with hardware-specific commands and the kernel just dispatching buffers.
Ok, so as you can see from the list above, I’m just talking about device detection, configuration and video mode selection.
Why this sudden desire to move video mode selection into the kernel? Well, for years, we were told by several video card vendors that there was “no way” we could ever figure out how to do mode setting without using the BIOS. If you believe you must use the BIOS, it’s difficult to see how that can be done from within the kernel; executing the BIOS requires either an x86 emulator or vm86 support, neither of which belong in the kernel. For a long time, I believed the video chip vendors. Silly me.
Several people, including Luc Verhaegen, had been moving BIOS-based drivers towards native mode setting. While the BIOS was a nice crutch, real support requires us to follow other operating systems and program the hardware directly. It’s the only way to get at the full range of hardware capabilities, although it does require a lot more code. And, some machines will not quite work right until magic tweaks are added. On balance, the number of machines fixed is greater than the number of machines broken (by a huge margin, given the lack of BIOS support for the native panel mode support in many laptops).
Now, if you have to have BIOS support for video mode selection, you have to wait for usermode to wake up before you can program the video card. There’s not a lot you can do early in the kernel, without some joyous adventure involving initrd devices that include a complete X server. Ick.
Once you embrace native mode setting, suddenly the range of options opens up and you can think about what might be possible if this code were moved down into the kernel. Of course, the first thought is how little you can move.
So, with that, we can start enumerating what capabilities must be in kernel mode to solve the list of problems above.
First, to get BIOS-free suspend/resume support working, we already require the ability to save and restore the entire graphics state, including synchronizing with applications performing graphical operations. For video systems involving external chips (most, these days), this also means saving and restoring the state of those chips.
While we may mock Windows and the BSOD, many people would be far happier with that than the current state of an Xorg system where kernel panics are not announced at all; instead, the screen simply freezes and the user has no idea what has happened. Even the ability to escape from this state is limited to those with the secret handshake knowledge. Getting back to a simple text mode and displaying the panic message requires the ability to program a fixed mode from within the kernel, and the ability to lock out other users of the graphics device (perhaps waiting for the hardware pipeline to drain, if necessary).
Eliminating the screen flashing during boot-up will require the ability to set the desired final graphical mode as soon as the kernel is running (or, even from the boot loader?). I suggest that the precise target mode can be computed during system installation time (and changed for next boot by the user session), so this only requires the ability to read the desired mode configuration from somewhere at boot time. Kristian Høgsberg has been working on an early switch to the final graphical mode, but hasn’t managed to eliminate the flashing.
I envision fancy user switching being implemented by a transition from one X server to another rather than trying to get two user sessions running inside the same X server. Right now, we switch users with a VT switch where the first session switches back to text mode and the second switches back to graphical mode. Flash, Flash. Having the kernel recognise that the two X servers were using the same mode will eliminate the flashing nicely, the only remaining issue is how to animate from one session to the next. I think that’s largely a matter of being able to allocate multiple front buffers and creating an animation client that uses two X server front buffer images to construct the intermediate representations.
Finally, building a kernel API for all of this has become possible because we’ve all come to recognise that there are commonalities in mode selection across video hardware. The Intel driver had some separation between CRTC and output when we started working on it. Luc’s Via driver has been moving this direction for some time. While the Radeon driver uses a different approach (it sets everything on the card every time you touch anything), it too recognises the distinction between CRTC and Output. Matthew Tippett suggested that even the closed source ATI driver worked this way as he, along with Kevin Martin, redesigned RandR 1.2 to work this way.
Because we now have a reasonably common abstraction across a wide range of drivers, it now seems tractable to produce a common API. We’ve already started this inside the X server itself; the hope is that working on the common layer there will inform choices about how the kernel API should look. This will include
- Mode setting primitives.
- GPU ring buffer management
- Memory management of some kind.
Yes, this makes modesetting just an addition to the existing DRM drivers; they’re responsible for the GPU and memory management stuff, and the modesetting driver needs both of those bits to perform the operations listed above, so it cannot be ‘underneath’ the DRM driver. We welcome our new DRM overlords.
Posted Mon Feb 5 22:17:36 2007We’ve been working busily at RandR 1.2, both cleaning up the extension specification and trying to build a infrastructure that will help people get support into all of the video drivers. I thought I’d let people know where things stand and what remains to be done.
First off, things that are fairly stable now:
RandR 1.2 protocol specification. This hasn’t changed in over a month now and its looking like we’re finished. The XCB binding work cleaned up a few mistakes, but no semantic changes. The master branch now includes 1.2.
Xrandr library. Except for failing to copy a structure field from the GetCrtcInfo reply into the application structure, nothing has changed here in over a month either. Here as well, the master branch includes 1.2 support.
X server RandR 1.2 support. Two minor changes in the last month on the randr-1.2-for-server-1.2 branch. I’m not quite sure how we’ll want to release things, but I’m hoping to do a 1.2 ABI-compatible release shortly after 7.2 that includes RandR 1.2 support and can use old or new drivers (of course, old drivers won’t offer any snazzy 1.2 features).
One piece that needs a few new features is the xrandr application. I just spent a day cleaning things up a bit (it could use a more cleanup, especially splitting the source across multiple files). I added the ability to position outputs relative to one another and also a global ‘—auto’ mode which turns on all connected outputs and turns off all disconnected ones. With that, I hooked up a new global hotkey in metacity to run ‘xrandr —auto’ so I can just plug in a new monitor, hit the key and expect it to light up. There are still a few more tasks to take care of here:
Handle the —extend option to place new outputs adjacent to existing outputs instead of at 0,0.
Receive and validate events for all of the changes; this would make testing the server event generation possible.
Control output properties and gamma ramps. Right now, there’s no way to test output property effects, and we’re hoping to use them for all kinds of stuff from backlight intensity to TV output settings.
On the driver front, we’re doing development in an unusual fashion (and we may regret it if we’re not careful). As our goal is to produce a single driver binary that will run against both 7.2 and newer X servers, we cannot depend on having any new functions in the server binary.
To avoid server dependencies, we’re building a bunch of new driver-independent functionality as if it were part of the server binary and linking that into the Intel driver. As much of this code started life as driver-dependent explorations of how to make RandR 1.2 work, it isn’t quite as independent as we’d like. We’ve done some piecemeal attempts to make it look a little better, but the result is actually fairly ugly at present with a mish-mash of function naming schemes and remnants of driver-dependencies in odd places.
Dave Airlie has been copying this code into the nouveau driver and hooking up RandR 1.2 support there. That’s great as it gives us a chance to make sure these new interfaces are right for more than one driver. I’m hoping someone will also take a look at how this will work in the radeon driver; with that, we’ll have a reasonably broad experience with the new interfaces and should be able to avoid nasty surprises down the road. Of course, drivers will still be able to completely by-pass this layer within the server, so at least we won’t make fixing it impossible, only painful.
I’ve also recently started on some xorg.conf support for output configuration. Right now, this consists of the ability to associate a Monitor section in the config file with each output of the device. From the monitor section, you can add new mode lines, specify DPMS support, override sync ranges and set a preferred mode.
We’ve also started adding some monitor ‘quirks’ to the EDID detection code; I got a trio of monitors from France that all had various incorrect data in their EDID blocks, including one monitor which reported a preferred mode of 640x350. I’d like to keep adding more quirks as we find broken monitors; that lets everyone share the same fixes. Of course, with the xorg.conf support now available, you can override most of the EDID data and work around things at run-time, but if you do have to do this, please submit a bug report and attach the broken EDID data (xrandr —prop will print it out for you).
Aside from general infrastructure cleanup, we’ve still got some features missing from the implementation that we’d like to play with:
Per-CRTC gamma tables. Given the existence of the XFree86VidModeExtension support for per-screen gamma tables, I think we’ll want to think a bit about how our per-CRTC gamma tables might be affected by VidMode applications, probably the right answer is that VidMode updates will write all CRTC gamma tables. It looks like this is mostly glue work though, the driver already exposes the needed functionality (albeit not per-crtc yet).
Output properties. Right now, we’re reporting EDID data back to clients as an output property, but I’d like to see an example of using properties to control an output in some fashion. One obvious example is the backlight brightness value; that seems really useful and should be fairly easy to hook up. Beyond implementation, it would also be nice to start building a list of standard output properties and their interpretation and placing those in the RandR specification. I expect to update the RandR specification regularly with new output properties; don’t be shy if you have something you’d like to use here.
Finishing the xorg.conf support to include layout and detection override options. Right now, the driver autodetects available outputs, and sometimes it gets it wrong. In particular, mobile Intel chipsets appear to have no way to detect whether an LVDS output is hooked up, so we’re starting to special-case a long list of products using mobile chipsets without a local LCD panel (mac mini, aopen boards, etc).
Xinerama ordering. Right now, you get Xinerama screens ordered by internal structure within the driver. I’m thinking we want to use output properties to order things instead; perhaps something as simple as ‘XINERAMA_HEAD’ values in each output that tell the server how to order things. This should be done entirely within the DIX randr code, not the driver. Some applications use this information to place dialogs and other details.
And, of course, it would be fun to see some applications starting to use this, in particular KDE and Gnome both have screen size setting applets which could see some significant enhancements now.
Posted Sun Dec 31 23:53:05 2006Well, RandR 1.2 work is progressing apace; I can now reconfigure the X server in some fairly dramatic ways. I now regularly use the 1600x1200 monitor at home as extra desktop space for my laptop, growing the X root window to cover both monitors.
But, what I’ve lost in the process is any ability to configure the system at startup time. This is rather unusual; the system is now far more flexible through the RandR protocol than through the configuration file. Getting things set right at startup time seems important, as that will avoid flashing monitors as they change modes and other possible issues.
I started the process of allowing startup-time configuration by making the RandR 1.2 code permit object creation before the Screen objects were created. This allows the driver to create the necessary RandR 1.2 structures and use those to control the configuration process. This would work, except that I’d like the same driver to work in the absence of RandR 1.2 in the core server.
Back to the drawing board.
What I’m doing now is creating some new structures that map to the RandR 1.2 structures but which are hw/xfree86 specific and which don’t depend on RandR 1.2 in the core server. With the goal of eventually moving these into the hw/xfree86 portion of the server, these structures provide all of the RandR 1.2 semantics using smaller driver-specific methods. The code for this new work can then use these new data structures to configure the server at startup time, as well as on-the-fly at runtime using RandR 1.2.
The two primary data structures are the xf86CrtcRec and the xf86OutputRec:
struct _xf86Crtc {
/**
* Associated ScrnInfo
*/
ScrnInfoPtr scrn;
/**
* Active state of this CRTC
*
* Set when this CRTC is driving one or more outputs
*/
Bool enabled;
/**
* Position on screen
*
* Locates this CRTC within the frame buffer
*/
int x, y;
/** Track whether cursor is within CRTC range */
Bool cursorInRange;
/** Track state of cursor associated with this CRTC */
Bool cursorShown;
/**
* Active mode
*
* This reflects the mode as set in the CRTC currently
* It will be cleared when the VT is not active or
* during server startup
*/
DisplayModeRec curMode;
/**
* Desired mode
*
* This is set to the requested mode, independent of
* whether the VT is active. In particular, it receives
* the startup configured mode and saves the active mode
* on VT switch.
*/
DisplayModeRec desiredMode;
/** crtc-specific functions */
const xf86CrtcFuncsRec *funcs;
/**
* Driver private
*
* Holds driver-private information
*/
void *driver_private;
#ifdef RANDR_12_INTERFACE
/**
* RandR crtc
*
* When RandR 1.2 is available, this
* points at the associated crtc object
*/
RRCrtcPtr randr_crtc;
#else
void *randr_crtc;
#endif
};
struct _xf86Output {
/**
* Associated ScrnInfo
*/
ScrnInfoPtr scrn;
/**
* Currently connected crtc (if any)
*
* If this output is not in use, this field will be NULL.
*/
xf86CrtcPtr crtc;
/**
* List of available modes on this output.
*
* This should be the list from get_modes(), plus perhaps additional
* compatible modes added later.
*/
DisplayModePtr probed_modes;
/** EDID monitor information */
xf86MonPtr MonInfo;
/** Physical size of the currently attached output device. */
int mm_width, mm_height;
/** Output name */
char *name;
/** output-specific functions */
const xf86OutputFuncsRec *funcs;
/** driver private information */
void *driver_private;
#ifdef RANDR_12_INTERFACE
/**
* RandR 1.2 output structure.
*
* When RandR 1.2 is available, this points at the associated
* RandR output structure and is created when this output is created
*/
RROutputPtr randr_output;
#else
void *randr_output;
#endif
};
The hardware is manipulated through driver-specific functions contained in the xf86CrtcFuncsRec and xf86OutputFuncsRec:
typedef struct _xf86CrtcFuncs {
/**
* Turns the crtc on/off, or sets intermediate power levels if available.
*
* Unsupported intermediate modes drop to the lower power setting. If the
* mode is DPMSModeOff, the crtc must be disabled, as the DPLL may be
* disabled afterwards.
*/
void
(*dpms)(xf86CrtcPtr crtc,
int mode);
/**
* Saves the crtc's state for restoration on VT switch.
*/
void
(*save)(xf86CrtcPtr crtc);
/**
* Restore's the crtc's state at VT switch.
*/
void
(*restore)(xf86CrtcPtr crtc);
/**
* Clean up driver-specific bits of the crtc
*/
void
(*destroy) (xf86CrtcPtr crtc);
} xf86CrtcFuncsRec, *xf86CrtcFuncsPtr;
typedef struct _xf86OutputFuncs {
/**
* Turns the output on/off, or sets intermediate power levels if available.
*
* Unsupported intermediate modes drop to the lower power setting. If the
* mode is DPMSModeOff, the output must be disabled, as the DPLL may be
* disabled afterwards.
*/
void
(*dpms)(xf86OutputPtr output,
int mode);
/**
* Saves the output's state for restoration on VT switch.
*/
void
(*save)(xf86OutputPtr output);
/**
* Restore's the output's state at VT switch.
*/
void
(*restore)(xf86OutputPtr output);
/**
* Callback for testing a video mode for a given output.
*
* This function should only check for cases where a mode can't be supported
* on the pipe specifically, and not represent generic CRTC limitations.
*
* \return MODE_OK if the mode is valid, or another MODE_* otherwise.
*/
int
(*mode_valid)(xf86OutputPtr output,
DisplayModePtr pMode);
/**
* Callback for setting up a video mode before any crtc/dpll changes.
*
* \param pMode the mode that will be set, or NULL if the mode to be set is
* unknown (such as the restore path of VT switching).
*/
void
(*pre_set_mode)(xf86OutputPtr output,
DisplayModePtr pMode);
/**
* Callback for setting up a video mode after the DPLL update but before
* the plane is enabled.
*/
void
(*post_set_mode)(xf86OutputPtr output,
DisplayModePtr pMode);
/**
* Probe for a connected output, and return detect_status.
*/
enum detect_status
(*detect)(xf86OutputPtr output);
/**
* Query the device for the modes it provides.
*
* This function may also update MonInfo, mm_width, and mm_height.
*
* \return singly-linked list of modes or NULL if no modes found.
*/
DisplayModePtr
(*get_modes)(xf86OutputPtr output);
/**
* Clean up driver-specific bits of the output
*/
void
(*destroy) (xf86OutputPtr output);
} xf86OutputFuncsRec, *xf86OutputFuncsPtr;
Right now, I’ve just hacked up the Intel driver internals using these new structures and have left the implementation a tangled mess with driver-specific, randr-specific and other code all tied together. Obviously this is not a long-term plan, but I want to change the data structures first, then split the code apart.
At this point, I’ve managed to get the Intel driver to compile with the new data structures, so I’ll first get it working then start cleaning up the implementation so that driver-independent code is clearly separated and performing as much of the work as possible.
Once driver-independent code is in place, I will add startup configuration to the driver-independent code. I’ll probably use the existing Radeon configuration file options as much as possible, and probably accept the Intel configuration file options as well. While those do not expose the full capabilities of the RandR 1.2 extension, I suspect they’re sufficient for most users.
It’s a long way around to get startup configuration for the new capabilities in the Intel driver, but I’m hoping it will allow us to create a common configuration language for all of the drivers and also remove the current dependence on RandR 1.2 from the Intel driver for much of this functionality.
Posted Sun Nov 26 17:28:51 2006While busily rewriting the RandR extension and the Intel driver to match, we decided to tackle another major issue, what to do when the driver isn’t given any instruction in the config file about how to light up the screens. And, beyond that, how to make that configurable in reasonable ways while making it driver-independent.
The current scheme is a mish-mash of crufty ancient code and magic driver-specific hacks. The ancient driver-independent code doesn’t understand that a single video card can have multiple monitors, so the normal configuration mechanisms are mostly harmful. For the Intel driver, the mode that you specify in the configuration file is almost, but not entirely, ignored.
In the bad old days of BIOS-based mode selection, it would just use the specified mode to try and match some mode present in the BIOS. In the brave new world of native mode setting, we can at least use the mode provided directly, assuming the output is capable of using it. In either case, all screens were programmed with the same mode (more or less). Not very useful when you have a 1024x768 internal panel and a 1600x1200 external monitor.
Almost all of the i830 and later Intel chips have two “pipes”. Each pipe can be connected to a variety of “outputs” (where an output is effectively a connector, like a local LCD panel, or an external VGA connector). The “pipes” are important here because it takes a pipe to hold a specific mode, the pipe fetches data from the frame buffer and sends it to the outputs with the timing specified by the mode.
Now, the weird thing is you can sometimes connect multiple outputs to a single pipe. But, when you do that, each output gets exactly the same mode and sees exactly the same pixels out of the frame buffer. Plus, there are other restrictions, like you can’t share a pipe with the local LCD panel or the TV output on the 945. Whatever. We mostly ignore this at present because it’s not that useful, and it’s a pain to think about. Of course RandR supports it and will expose it to the user when it can work, but that’s not often.
Ok, with that brief diversion into the oddities of the Intel graphics chip, let’s get back to configuration.
To allow the user to customize how modes were set, there were three parameters in the config file:
Option "Clone" "yes"
Option "CloneRefresh" "60"
Option "MonitorLayout" "LFP,CRT"
The “Clone” option directed the driver to turn on two of the outputs. The “CloneRefresh” option specified the vertical sync rate for the “other” monitor. “MonitorLayout” gave the user precise control over which outputs are connected to which pipes.
The current BIOS-based driver on the master branch adds a bunch more:
Option "MergeFB" "yes"
Option "MetaModes" "1024x768"
Option "SecondHSync" "80-130"
Option "SecondVRefresh "50-75"
Option "SecondPosition "RightOf"
Option "MergedXinerama" "yes"
As you might guess, these all combine to let you place the second monitor somewhere other than right on top of the first monitor. Useful when you have two monitors on your desktop.
To confuse you further, the Radeon driver uses a different set of options to perform exactly the same function. Cool, huh? Even more fun is that these two drivers have completely different semantics of how to interpret the lack of these options. Makes building a configuration tool fairly challenging, and makes the chances that the user will get “random” results high.
So, the first thing to realize is that RandR 1.2 makes all of these things entirely configurable. But, not until you have the server running and can connect and X client. Bummer. With that, you’d have to put the desired configuration into a startup utility and you’d watch the screen flash a couple of times as you were logging in. Fun for some, but annoying for most.
Given that RandR has sufficient power to configure things after the server has started, and that this is expressed in a driver-independent fashion, it seems sensible to figure out how to use that information at startup time to make better choices and ease customization.
The first piece I’ve done is to replace the existing default mode selection logic with something a bit fancier and (I hope) more generally useful. After that, I’ll write up some replacement configuration options and use those to mutate this configuration.
For the initial default configuration (used when no options are present in the configuration file), I made some simplifying assumptions:
- Just start up in clone mode. Every display showing the same picture.
- Monitors that have preferred modes should use them if possible
- 96dpi is what Microsoft uses; pick a mode close to that if possible
- Give every monitor a similar size.
Given that I wanted to present the same data on every monitor, I started by picking a single monitor to control the size of the screen. For this, the code first looks for a monitor with a preferred mode; those are usually either the laptop LCD panel or an external DVI-connected LCD monitor. If no such monitor is present, the code picks a random monitor and selects a mode that will present data at about 96dpi. I think this makes more sense than picking the highest supported resolution as CRTs often advertise support for incredibly high resolutions that end up fuzzy and dim. Better to just pick a reasonable size by default and let the user change it after login.
Once the first mode is selected, all of the other monitors are set to modes that are close to that size.
Finally, the list of monitors is used to compute the maximum screen size that should be permitted. Yes, we’re still stuck with allocating the frame buffer at server init time. This will, eventually, go away, but that’s related to rendering infrastructure which we’re ignoring this week. In any case, the maximum size is computed by figuring out how much space is needed to place all of the known monitors side-by-side. For outputs which don’t have any monitor connected, we just pretend that they’ll max out at 1600x1200. The end result is that there shouldn’t be any limits on which mode combinations can be used in clone or mergefb mode.
Right now, all of this code is down inside the Intel driver. I will pull the DIX-level functions up to the RandR code. What remains is to decide how to make the remaining code driver independent and (eventually) move it to the xf86 common layer where it can be shared across multiple drivers.
All of this work is available on the modesetting branch of the Intel driver when built against the randr-1.2-for-server-1.2 branch of the X server.
Posted Thu Nov 16 22:32:14 2006What a pleasant weekend I’ve had; no nasty meetings or politics, just some good clean hacking fun.
I spent a few hours each day poking at the modesetting branch and getting it working on my shiny new 965-based desktop system. Eric had been working on the SDVO support and gotten that working, so I figured I’d at least get the CRT output working, which seemed like an easy enough task.
The BIOS-based modesetting code was already working, so I knew the hardware worked correctly. But, our existing CRT modesetting code was producing a nice black screen. I love modesetting code—the most common error indication is just ‘sadness’; the monitor remains black and indicates that there is no signal present on the wire.
I poked around looking at what the BIOS did and how that differed from what the modesetting driver did and made a small bit of progress thanks to an accident. I left the video clock programmed for 1600x1200 and then asked the server to display a 640x480 mode. One would expect this would leave the video mode running far too fast. But, to my surprise, the monitor happily locked onto this and reported that it was running at 85Hz. Weird. A bit of math and I discovered that somehow the clock was getting divided by 4 somewhere. Sure enough, simply multiplying the real clock by 4 left me with stable modes across a wide range of sizes. Unfortunately, not including the high-resolution modes loved by our users; those were now out of the reach of the programmable clock. But, it made the question of where the problem was a bit clearer—something was wrong with the clock register programming.
The next accidental discovery was that in pure clone mode, with both CRT and DVI connected to the same pipe, I also managed to see a working mode for the lower resolutions (640x480 and 800x600). Higher resolutions still failed. It’s important to note that the DVI connector is reached through the SDVO port, which must run at high frequencies. Low frequency modes are padded with junk and clocked faster to keep the bus stable. For 640x480 and 800x600 modes, the clock is multiplied by 4.
While I had looked at the register results for the BIOS mode setting, I hadn’t seen it in action. Fortunately, action shots were available.
Dave Airlie hacked up Matthew Garrett’s vbetest program and left it on here. This fine piece of work executes the video bios and monitors all of the video device register accesses it performs. Watching it live allowed me to see precisely which registers it thought were related to clock timing. I noticed that it set the DPLLAMD register when programming a pure CRT mode, something which seemed a bit odd to me as that register has rather vague documentation about UDI and SDVO outputs. But, one small sentence did mention CRT multipliers of some sort, so I figured I might as well give it a try. Stealing the same setting that it used, the CRT now locked nicely using the normal un-multiplied clock frequency, and worked across the whole range of modes.
Thanks Dave, thanks Matthew.
The final adventure for the weekend was to discover why my screen image was getting corrupted when I used a large frame buffer. The effect was quite mystic—contents written to one location in memory would be duplicated to many locations on the screen. I thought it might be fifo size issues, but exploration with an application window and a window manager demonstrated that the corruption was not just on the screen, but actually visible to the GPU and CPU as well, and appeared to be caused by multiple GATT entries mapping to the same physical page. I eventually disocovered that just skipping the first 256K of video memory and not using fixed the problem; I haven’t looked into this in more detail, but it seems likely that those areas are actually mapped to the GATT table itself, and using them for other things caused the symptoms observed above. For now, I’ve just made the driver skip over that amount of memory; it fixes the problem I had.
I also spent a bunch of time shrinking the driver to eliminate a bunch of redundant state. We’re planning on moving all of the initial frame buffer configuration to common RandR code shared across drivers, so I went ahead and disabled the Intel-specific code in the driver. Yes, this means that there’s no way to configure screen layout when you start the server, but you can use the RandR extension afterwards to make it do whatever you like. It’s temporary, eventually we’ll get the common code working. Probably the first time X has had this kind of state which can only be set through the protocol and not in the config file.
As Eric made a merge that broke things before he took off for the weekend (strong work, Eric), I’ve placed my work on a separate branch for merging this week sometime. Everything here is on the modesetting-keithp branch in the xf86-video-intel repository.
Posted Sun Nov 5 20:02:27 2006While in Shanghai a few weeks ago, I picked up a Nokia 6131 telephone. Prices there are quite reasonable, and I was fed up with my Motorola Razr (the worst phone ever invented, as far as I can tell). Friends familiar with Nokia phones suggested I might prefer a Series 40 phone, which while less feature-rich than the Symbian models tend to run quite a bit faster. Shopping for telephones was quite easy; every model I’d ever heard of was available from multiple vendors. Someday maybe the US will rediscover the simple joy of providing what the customer wants.
In any case, once we had switched the 6131 from Chinese to English, I slipped my sim card into place and was happily conversing and taking pictures with the new toy.
Of course, one of the big goals in moving from the Razr to the Series 40 was to get synchronization between my Evolution contacts and calendar and the applications on the telephone. Not since my Treo 600 had I been able to see my schedule and access my whole phone book from my cell phone.
The 6131 putatively supports SyncML, but attempts to get the opensync SyncML plugin working ended in failure after much gnashing of teeth. The plugin would load, but the synchronization process would just hang without transferring a single phone number. Sigh. Clearly this phone doesn’t quite follow the same interpretation of the syncml standard as the opensync plugin.
Finally, I started looking around for alternatives when I discovered that the Gnokii folks had created a gnokii opensync plugin using their Nokia-specific backend. Shockingly, there was no Debian package, so I downloaded the latest source and built it. Surprisingly enough, it appears to work fine.
Well, almost fine. I’ve got ‘a few’ contacts in my address book, and the simplistic gnokii code for locating a free address book slot was reading every address book entry looking for a free spot. Oddly, it spent a lot of time re-reading address book entries. That was easy to fix at least.
Next, I discovered that the gnokii sync code wasn’t dealing with finite repeating events, events which repeat for a while and then stop. Every repeating event would go on forever. I use repeating events for conferences by setting them to repeat every day for the length of the conferences. I go to a few conferences each year, so with 10 years of conference history, I had several repeating events occurring ‘today’. I tried to make this work correctly; the phone appears to have a notion of ‘occurrences’, which I was guessing meant a count of repeat events. I didn’t manage to get this working, so I kludged around it and set these events to non-repeating, which at least removes them from view for the moment.
Of course, hacked versions of gnokii and gnokii-sync are available from my git repository.
Ah, life with a functioning telephone again. We’ll see how long this lasts.
Posted Tue Oct 31 00:54:45 2006Ok, so one thing I haven’t blogged about is how the X.org SCM selection was made. Some of you may know the story, but it may surprise others to learn that there was no democratic process involved. Usually, X.org leans heavily on concensus or at least voting when making global choices about the direction of the project; I’ve been subjected to some of them and accept the consequences when things don’t go my way.
However, when selecting and SCM, I decided (already the tyrant) that it really couldn’t involve even a substantial minority of the project developers. Learning enough about the available SCMs takes a lot of time; I spent about a year looking at options and trying things out. During that time, I downloaded SCM source code, built repositories, converted bits of the X.org tree and looked at the results.
Finally, last January, I was fortunate to be at LCA along with key developers for Bzr, Mercurial and Git. Taking advantage of the situation, I sat down with each of them and talked about their system architecture and overall goals. After the week was over, it was clear to me that the right choice for X.org was Git. I’d say the chances of getting a dozen key X.org developers to spend that kind of time doing this research are slim, I happened to enjoy significant latitude in my activities between OLS and LCA in 2005 as I moved from HP to Intel.
Any choice which involves forcing dozens or hundreds of people to study abstruse details about a system which they have no fundamental interest is doomed to failure—most of the group will just not bother, and will end up choosing essentially randomly, with a slight bias to whatever is most familiar, assuming that familiar choices will be less likely to be really bad.
Hence Subversion; it sounds safe because it’s advertised as ‘just like CVS, except fixed a bit’. After only a few months, it was clear that SVN would be a really poor fit for X.org, and X.org developers were already rumbling about moving to SVN as the obvious upgrade from CVS. Clearly democracy was going to fail here.
So, I took matters in my own hands and pre-emptively switched a significant, if fairly stable, piece of the X.org infrastructure from CVS to Git. Perhaps not handled in the most politic fashion, the result was a reasonably animated discussion about the results. Suddenly the discussion ended; people discovered that Git wasn’t that frightening and that I was reasonably serious about keeping at least the pieces I owned under Git control. Discussions about how to complete the migration from CVS to Git ensued shortly thereafter with a complete migration plan created that spanned a few months.
Yes, the developers were forced to come up to speed on a new SCM if they wanted to contribute. But, making the switch without a lot of sturm und drang meant we could focus more resources on helping those users figure out how to use the new system and fixing the various problems uncovered by the migration. I’m happy to say that now the transition is complete and development is proceeding apace. And, we’ve even gained contributions from some users who were already used to Git (thanks, Greg K-H) but refused to use CVS.
Posted Sun Oct 22 20:44:23 2006