RSS Add a new post titled:

In case you’ve been hiding under a rock for the last several months, I’d like to remind you that the Linux Plumbers Conference is currently soliciting submissions for the following tracks:

  1. Audio: Lennart Poettering
  2. Boot and Init: Dave Jones
  3. Embedded Systems: Greg Kroah-Hartman and David Woodhouse
  4. Energy Efficiency, Performance, and Power Management
  5. Inter-Distributor Cooperation: James Bottomley
  6. Kernel/Userspace/User Interfaces: Jim Gettys
  7. Networking: Steve Hemminger
  8. Security: James Morris and Paul Moore
  9. Storage: Matthew Wilcox
  10. Video Input Infrastructure
  11. X Window System: Keith Packard

I’m particularily interested in submissions around the changes in the Linux desktop, past present and future. We’re seeing all kinds of new Linux-based user interfaces around these days, and I’d like to hear about where things are going, from both the hardware and software perspective. It’s Plumbers, so sessions which will generate active discussion among the participants are the best kinds.

As appears the tradition with Linux conferences, we’ve received numerous requests for “just a bit more time” in the submission process, and so the deadline has been extended from today until next Monday, June 22nd. Please head on over to the submission page and make sure we know you’re interested in contributing.

Posted Mon Jun 15 19:06:27 2009

To build code for TeleMetrum, we’re using SDCC, the Small Device C Compiler as the CPU inside the cc1111 is an 8051 clone, an 8-bit microprocessor for which SDCC has excellent support (more about the flight software later).

SDCC version 2.9.0 was recently uploaded to Debian unstable, and when I built our flight software with the new version, I discovered a bug in the display of strings formatted by printf. First assuming that the bug was in my source code, I tried to figure out what I’d done wrong, but then I eventually looked that the 8051 assembly output (ick) and discovered that the compiler was generating the wrong code for pointers when passed to a varargs function. A bit of hacking and I soon had a short test case that demonstrated the bug:

extern void f(char *x, ...);

void
func(__xdata char *s)
{
    f("hi", s);
}

I filed a brief bug report and attached the test case, then went to download the current source code to see if I couldn’t uncover the source of the bug. I have to say that reading through the SDCC source code was reasonably pleasant; a competent compiler in very little code that was easy to grasp. I eventually located the bug, and discovered that it was from a change made last December as part of a pointer-related optimization, and I posted a patch that I found would fix the specific problem I had found.

The nicest part came next — once I’d posted the patch, a reasonably lively discussion between Maarten Brock, Borut Ražem and Raphael Neider came to a quick concensus about what the desired behavior in this case would be.

Then, Maarten Brock applied my patch to the project and, much to my amazement, he included a regression test that verified the desired behaviour in both the case that I had uncovered and several other cases as well.

I just want to applaud these developers for building a great compiler and running a great project.

Posted Fri May 1 13:40:50 2009 Tags:

This week, we finished up our 2009 Q1 release of the Intel driver. Most of the effort for this quarter has been to stabilize the recent work, focusing on serious bugs and testing as many combinations as we could manage.

For the last year or so, we’ve been busy rebuilding the driver, adding new ways of managing memory, setting modes and communicating between user space and the kernel. Because all of these changes cross multiple projects (X/Mesa/Linux), we’ve tried to make sure we supported all of the possible combinations. Let’s see what options we’ve got:

Mode Setting

  1. User mode. The entire output side of the driver stack is in user mode; all of the output detection, monitor detection, EDID parsing etc. This has some significant limitations, the worst of which is that the kernel has no idea what’s going on, so you cannot show any kernel messages unless the X server relinquishes control of the display. In particular, panic messages are lost to the user. If the X server crashes, the user gets to reboot the machine. A more subtle limitation is that the driver couldn’t handle interrupts, so there wasn’t any hot-plug monitor support. That’s becoming increasingly important as people want hot-plug projector support, and as systems start including DisplayPort, which requires driver intervention when the video cable gets kicked out of the machine.

  2. Kernel mode. All of that code moves into the kernel, where it is exposed both as a part of the DRI interface and also through the frame buffer APIs for use by the fb console or any other frame buffer applications. Lots of benefits here, but the development environment is entirely different from user mode, and so porting the code is a fair bit of work. Dave Airlie has some pie-in-the-sky ideas about making the kernel mode setting code run in user mode by recompiling it with suitable user-mode emulation of the necessary kernel APIs.

Direct Rendering

  1. None. In this mode, the system doesn’t support direct rendering at all, so all rendering must go through the X server. GL calls are generally implemented as a software rasterizer inside the X server.

  2. DRI1. In this mode, applications share access to a single set of front/back/depth/stencil buffers. They must carefully ensure that all of their drawing operations are clipped to the subset of each buffer that their window occupies. Each application performs their own buffer swapping by copying contents from suitable regions of each buffer. Synchronization with the X server is done through signals and a shared memory buffer. While any application is executing, all other applications are locked out of the hardware, even if they wouldn’t be conflicting.

  3. DRI2. This gives each application private back/depth/stencil buffers; they draw without taking any locks as the kernel mediates access to each object. The real front buffer for each window is owned by the window system, and so requests to draw into it are directed through the window system API, using the DRI2 extension in the case of the X window system. When applications ask to draw to the ‘front’ buffer, they get a fake buffer allocated, which operates almost exactly like the back buffer, except that copies to the ‘real’ front buffer are automatically performed at suitable synchronization points.

Memory Management

  1. X server + Old-style DRI. In this mode, the X server asks for a fixed amount of memory which is then permanently bound to the graphics aperture and treated like the memory on a discrete graphics card. The X server allocates pixmaps from this fixed pool, when that runs out, it uses regular virtual memory. It may move objects back and forth by copying them between the aperture and virtual memory.

    This mode also supports direct rendering. While the direct rendering application holds the DRI1 lock (remember that from above?), it has exclusive access to a area within the aperture which is granted to it by the X server. Pages for this area are statically allocated (by the X server). Whenever the application loses the DRI1 lock, any or all of the data stored in those pages may be kicked out, so the application must be prepared to lose the data without notice. For data like textures and vertex lists, which are generated by the application and not (generally) written by the GPU, this works fairly well; the application has a copy of the data already and can re-upload it should it disappear.

  2. GEM. Here, no pages are statically allocated for exclusive use by the graphics system. Instead. individual objects (“buffer objects”, or “bo’s”) are allocated as chunks of pages from regular virtual memory. When not in-use, these objects can be paged out. Furthermore, applications aren’t limited to the graphics aperture space, they allocate from the system pool of virtual memory instead. As objects are used by applications, the kernel dynamically maps them into the graphics aperture.

2D acceleration

  1. None. The X server has a complete software 2D rendering system (fb), and if the driver doesn’t provide any accelerated drawing mechanism, the X server can use that software stack to provide all of the necessary drawing operations. While it seems like this would be terribly slow, in reality, it’s not that bad as long as the target rendering surface is present in cached memory, and not accessed through a device in write-combining or uncached modes.

  2. XAA. This is the old XFree86 rendering architecture, and heavily focuses on ‘classic’ X drawing operations, including zero-width lines, core text and even core wide lines and arcs. It does not support accelerated drawing to anything other than the screen or pixmaps which precisely match the screen pixel format. Pixmaps are allocated from a subset of the frame buffer, as if they were actually on the screen. This causes huge problems for chips which have limited 2D addressing abilities (like Intel 8xx-945 and older ATI chips) as they cannot use any memory beyond a single 2D allocation.

  3. EXA. This code was lifted from kdrive, where it was designed as a minimal graphics acceleration architecture for embedded X servers. In that original design, all pixmaps were allocated in system memory as the target systems had essentially no off-screen memory available to the graphics accelerator. In addition, with the goal of bringing up an X server quickly on simple hardware, the only accelerated operations were 2D solid fills and 2D blits. However, one key feature of that code was that it provided a uniform API for drawing with arbitrary pixel formats. As that code was moved into the core X server, it was changed so that pixmaps could be allocated from graphics memory. In addition, acceleration for the Render extension was added so that modern applications could get reasonable performance for anti-aliased text and composited images. However, it can only accelerate rendering to objects stored in graphics memory, and that memory must be pre-allocated by the X server (see the Memory Management section above). Once you run out of that memory, the X server is stuck trying to figure out what to do. This single issue has been the focus of EXA development for the last couple of years — when to move data between virtual memory and graphics memory. Objects in graphics memory are drawn fastest with the GPU and objects in virtual memory can only be drawn with the CPU. The key problem here is that reading data from graphics memory is horribly expensive, so the cost of moving an object from graphics memory to virtual memory is high. When everything is in the right memory space, EXA runs fast. When you start thrashing things around, EXA runs slow.

  4. UXA. Assume your GPU can draw to arbitrary memory. Now assume that EXA’s basic drawing operations are sound, and do a reasonable job of supporting 2D applications (as long as they fit within graphics memory). UXA comes from the combination of these two assumptions — GEM provides the first and the EXA drawing code provides the second. UXA doesn’t need any of the (ugly) pixmap ‘migration’ code because pixmaps never move — they stay in their own little set of pages and the GEM code maps them in and out of the aperture as needed. So, UXA and EXA are not far apart in style or substance, UXA simply skips the parts of EXA which are not necessary in a GEM world.

Pick One From Each Column

Now, many of the above choices can be made independently — you can use User mode setting with DRI1, classic memory management and XAA. Or you can select Kernel mode setting with DRI1, GEM and EXA. With 2 × 3 × 2 × 4 = 48 combinations, you can imagine that:

  • Some of them can’t work together
  • Some of them haven’t been tested
  • Some of them haven’t been tuned for performance
  • Some work well on i915, and poorly on 965GM
  • Others work well on 965GM and poorly on 855
  • None of them (yet) work perfectly well everywhere

Two years ago, you had a lot fewer choices, only user mode setting, none or DRI1 direct rendering, only X server memory management and only none, XAA or EXA acceleration = 12 choices). Even then, choosing between XAA and EXA was quite contentious — EXA would thrash memory badly, while XAA would effectively disable acceleration for pixmaps as soon as it ran out of its (tiny) off-screen space.

In moving towards our eventual goal of a KMS/GEM/DRI2 world, we’ve felt obligated to avoid removing options until that goal worked best for as many people as possible. So, instead of forcing people to switch to brand new code that hasn’t been entirely stable or fast, we’ve tried to make sure that each release of the driver has at least continued to work with the older options.

However, some of the changes we’ve made have caused performance regressions in these older options, which doesn’t exactly make people happy — the old code runs slow, and the new code isn’t quite ready for prime time in all situations. One option here would be to stop shipping code and sit around working on the ‘perfect’ driver, to be released soon after the heat-death of the universe.

Instead, we decided (without much discussion, I’ll have to admit) to keep shipping stuff, make it work as well as we knew how, and engage the community in helping us make this fairly significant transition to our new world order. We did, however, make a very conscious choice to push out new code quickly — getting exposure to real users is often the best way to make sure you’re not making terrible mistakes in the design. The thinking was that users could always switch back to the ‘old’ code if the new code caused problems. Of course, sometimes that ‘old’ code saw fairly significant changes while the new code was integrated…

You can imagine that our internal testing people haven’t been entirely happy with this plan either — our count of bugs has been far too high for far too long, and while we spent the last three months doing nothing but fixing things, it’s still a lot higher than I’d like to see.

Performance Differences

Only a few things in the above lists have obvious performance implications — choose XAA and your performance for modern applications will suffer as it offers no acceleration for the Render extension. So, why does switching from EXA to UXA change the performance characteristics of the X server so much? The simple answer is that UXA, GEM and KMS haven’t been tweaked on every platform yet.

For example, hardware rendering performance is affected by how memory is accessed by the drawing engine. There are two ways of mapping pixels, “linear” and “tiled”. In linear mode, pixels are stored in sequential addresses all the way across each scanline, subsequent scanlines are at ever higher addresses. A simple plan, and all of the software rendering code in the X server assumes this model. In tiled mode, rectangular chunks of the screen are stored in adjacent areas in memory, a block of 128x8 pixels forms an ‘X tile’ in the Intel hardware. Drawing to vertically adjacent pixels in this mode means touching the same page, reducing PTE thrashing compared with linear mode. For systems with a limited number of PTEs and limited caches inside the graphics hardware, tiled mode offers tremendous performance improvements. However, getting everything lined up to hit tiled mode is a pain, and on some hardware, in some configurations it doesn’t happen, so you see a huge drop in performance.

Similarly, mapping pages in and out of the GTT sometimes requires that the contents be flushed from CPU or GPU caches. Now, GPU cache flushing isn’t cheap, but we end up doing it all the time as that’s how rendering contents are guaranteed to become visible on the screen. CPU cache flushing, on the other hand, is something you’re never “supposed” to do, as all I/O operations over PCI and communication between CPU cores is cache-coherent. Except for the GPU. So, we end up using some fairly dire slow-paths in the CPU whenever we end up doing this. UXA isn’t supposed to hit cache flushing paths while drawing, but sometimes it still happens. So, you get UXA performance loss sometimes. On the other hand, failing to dynamically map objects into the GTT means that some objects don’t fit, and so EXA spends a huge amount of time copying data around, in which case EXA suffers.

The difference between DRI1 and DRI2 is due in part to the context switch necessary to get buffer swap commands from the DRI2 application to the X server which owns the ‘real’ front buffer. For an application like glxgears which draws almost nothing, and spends most of its time clearing and swapping, the impact can be significant (note, glxgears is not a benchmark, this is just one of many reasons). On the other hand, having private back buffers means that partially obscured applications will draw faster, not having to loop over clip rectangles in the main rendering loop.

The obvious result here is that we’re at a point where application performance goes all over the map, depending on the hardware platform and particular set of configuration options selected.

Light at Tunnel’s End

The good news is that our redesign is now complete, and we have the architecture we want in place throughout the system — global graphics memory management, kernel mode setting and per-window 3D buffers. This means that the rate of new code additions to the driver has dropped dramatically; almost to zero. Going forward, users should expect this ‘perfect’ combination to work more reliably, faster and better as time goes by.

Right now, we continue to spend all of our time stabilizing the code and fixing bugs. A minor but important piece of this work is to get UXA running without GEM so that we have EXA-like performance on older kernels. That should be fairly straightforward as UXA shares all of the same basic EXA acceleration code, and the EXA pixmap migration stuff works best when it works in the most simplistic fashion possible (move to GPU when drawing, move out only under memory pressure), something which we can provide in the GEM emulation layer already present under UXA.

Our overall plan is to focus our efforts on the ‘one true configuration’. The best way to do that is to work on reducing the number of supported configurations until we get to just that one. First on the block are XAA and EXA. XAA because no-one should have to use that anymore, and EXA because it’s just UXA with some pixmap management stuff we don’t need. There’s no reason UXA should be slower than EXA, once the various hidden performance bugs are fixed.

At the same time, DRI1 support will be removed. We cannot support compositing managers under DRI1, nor can we support frame buffer resize and a host of other new features. You’ll still get a desktop without DRI1, you just won’t get accelerated OpenGL. With the necessary infrastructure in the kernel and X server already released, this seems like the right time to switch off a huge pile of code.

Initial measurements from this work show that we’ll be shrinking our codebase by about 10%.

Moving beyond this next quarterly release, the remaining ‘legacy’ piece is the user mode setting code. Something like 50% of the code in the 2D driver relates this this, so removing it will rather significantly reduce our code base. You can only imagine how excited we are about this prospect.

The goal is to take the driver we’ve got and produce a leaner, faster more stable driver in the next few releases to come.

Posted Fri Apr 24 17:12:45 2009

Now that I’ve got USB working on the cc1111, I’m feeling like it’s time to start thinking about actually building the flight software. I’ve written up some very rough ideas as a starting point based on what I know about the hardware we’ve got.

Posted Sat Feb 21 22:52:22 2009 Tags:

My good friend Bdale Garbee and I are working on a new rocket flight computer called TeleMetrum. It’s using a TI CC1111 microcontroller, which contains a digital RF transceiver along with a tiny microprocessor based on the universally loved Intel 8051.

The CC1111 has the usual array of microcontroller I/O ports: A to D converters, GPIO pins, SPI, I2C, I2S and regular serial ports. It also has a USB device controller, which is why we selected it over the otherwise identical CC1110.

I hadn’t ever written a USB device controller before, and my USB experience had been limited to writing a debug interface driver using libusb for this project, which should be the subject of another blog posting someday. USB appears to have been designed by mean people; just getting a simple two-way bytestream involves a huge pile of code.

Starting with FreeRTOS, I found a compatible USB stack written for the LPC2148 processor, lpcusb. Fortunately, that stack is fairly cleanly written, with a narrow interface between the stack and the device. I figured I could replace the LPC2148 bits with CC1111 bits and have it running in short order. Of course, nothing is as simple as it should be.

After about three weeks, I managed to get packets flowing from host to device and started to debug the USB setup stuff. All of my difficulties here relate to the slightly brain damaged way USB signals the end of Setup data flowing from device to host (IN data). This is done by sending a packet which is strictly less than the maximum size advertised by the device, in my case 32 bytes (that’s all the CC1111 can handle). If the data to be sent is a multiple of this max size, you send a zero-length packet afterwards.

The first bug was that my code simply delivered a zero length packet every time it was done sending data, in response to the ‘a packet has been delivered’ interrupt. It should have been obvious, but this ended up flooding the USB link with zero length packets. Once that was fixed, I had the first few setup packets working correctly. Next, I had to fix the code that would chunk up larger setup replies into multiple packets. With that done, I had the initial setup working correctly and the device appeared in the lsusb output.

While debugging this, I had noticed that my ISR was getting called ‘a lot’, and I found out that none of the USB interrupt status bits were on. I guessed that this meant the master USB interrupt bit was stuck on. Which confused me, as the other interrupt bits I’d played with on the CC1111 all automatically cleared themselves when the ISR was invoked. Hurray for inconsistent hardware, but it turns out that this is not true for all of the interrupts, only some of them. With the interrupts turned back off, I’ve now got a device which correctly responds to the USB setup and then sits idle:

idVendor           0xfffe 
idProduct          0x000a 
bcdDevice            1.00
iManufacturer           1 altusmetrum.org
iProduct                2 TeleMetrum
iSerial                 3 tele-0

The next task is to figure out how to send NAK packets back when the host asks for data and I have none to send. That may make it possible to send data back and forth, at which point I can write a simple command interpreter for the CC1111 so we can poke at it via USB.

All of this code is in my freertos git repository, I’ll see if the freertos or lpcusb people are interested in the code once it’s working.

Posted Wed Jan 28 23:49:00 2009 Tags:

My daughter bought me a BeeLine TX radio direction finding beacon for Christmas. The plan is to mount it inside the payload bay of various rockets so that I can find them after launch. This uses a PIC 16F688 processor and a CC1050 transmitter and sends FM-encoded beeps and Morse code ident strings. It’s tiny and runs for a long time off a Li-Po battery.

The BeeLine TX came pre-programmed to transmit at 433.920MHz and ident as ‘KD7SQG’. I wanted to move it to 440.700MHz and ident as ‘KD7SQG ROCKET’, but the configuration utility provided was Windows-only.

Fortunately, Greg Clark, the person behind Big Red Bee, released the source code for the firmware under the GPLv2 and provided full schematics as well.

I was able to read through the code and construct a simplistic programming utility, also released under the GPLv2 and available via git as beelinetx. It doesn’t do much yet, just allows the configuration of the frequency and transmitted message string.

I’d like to thank Greg for building the BeeLine TX, making the sources and schematics available and also for answering questions over email about some subtle aspects of the frequency calibration.

Now to wait for the weather to clear and go take it flying.

Posted Fri Dec 26 16:43:40 2008 Tags:

What’s up this week

The last week has certainly been entertaining. We’re quickly merging a pile of new code into the driver and trying to get everything building in one place so that people can play with stuff before we release.

Getting 2D on top of GEM

One of the big missing pieces last week was getting the 2D driver working with Pixmaps as GEM objects. This is critical as we move towards unified kernel memory management for rendering resources to allow us to use objects across multiple APIs. The most pressing need here is to enable the GLX_EXT_texture_from_pixmap extension in an efficient fashion.

So, what’s the plan then? Fairly simple; allocate GEM objects for every pixmap and then use GEM relocations to manage access to them. No need for the 2D driver to even know what’s bound to the GTT; it can treat every Pixmap exactly alike and let the kernel manage the low-level hardware details. Our experience with the 3D driver has been quite good; GEM is easy to use and reasonably efficient.

The initial thought was that we’d use EXA’s ability to forward pixmap creation back to the driver and have our driver call-back create the GEM object. However, in looking at that, it turns out to have a terrible (and incomplete) API. The driver has no say in the pixmap layout, it must use the EXA-enforced pixel organization. In a land of tiled pixmaps, that’s not OK. Further enquiry showed a wealth of other code which is useless in our uniform Pixmap environment. Damage tracking, and enforce hardware synchronization are wasteful performance robbing activities.

Ok, so if EXA isn’t what we want, then what is? Well, I like the basic EXA acceleration plan — accelerate solid fills, copy area and the composite operation and leave everything else to software. In fact, the whole EXA drawing API is just fine, it’s just the wasteful EXA code that isn’t necessary.

UXA — the UMA Acceleration Architecture

Ok, so instead of hacking up EXA and trying to make it work for the GEM driver and existing drivers, I decided to just make it work for GEM on UMA hardware and see what it looked like. The hope is that we’ll find some way to either patch EXA or at least find a way to share the low-level rendering code between UXA and EXA. For now, UXA lives in the intel driver itself; once we figure out how we want the X server rendering infrastructure to work, we’ll merge whatever results back into the core server.

I started UXA by just copying the existing EXA code and running an edit script to change all of the names. Then, I went through the code and removed everything dealing with pixmap migration, damage computation or explicit global hardware synchronization. The only synchronization primitive left is the prepare_access/finish_access pair which signals the start and end of software drawing. The hardware driver is expected to deal with all other synchronization issues itself.

Oddly, GEM does rendering synchronization automatically when rendering with the hardware, and provides simple primitives to provide for software fallbacks. The key here is that we never need to idle the whole chip, we only need to wait for it to finish working on whatever objects are currently being drawn with. The goal is to avoid artificial serialization.

The result is less than 5000 lines of code, as compared to EXA which has about 7500 lines.

Yeah, but does it work?

The short answer is “Yes, it works”. The longer answer is “Yes, with limitations”. The biggest limitation right now is that GEM objects can only be mapped directly by the CPU. For lots of operations, this is exactly what you want; a fully cached view into the objects as it offers full performance for CPU-bound rendering operations.

However, it has one performance problem and one functional limitation.

The performance problem is that using the CPU cache with these objects means flushing the CPU cache whenever switching between CPU and GPU rendering. CPU cache flushing is horribly expensive, enough so that it’s often far better to take the huge performance penalty of using un-cached reads if the number of reads is small.

Yes, we could create write-combining PTEs for this direct mapping, but constructing write-combining PTEs is also really expensive as that involves flushing those PTEs from every CPU TLB, which requires an inter-processor interrupt. Of course, you can’t just create a write-combining PTE, you have to make sure that the page it maps is not in any CPU cache, so you have to perform a CPU cache flush as well.

Someday maybe this won’t be true; there are plans afoot within the Linux kernel to make this reasonably efficient. Perhaps this will happen before we get our flying cars.

So, it’s a performance problem; we can deal with that.

Tiled Surfaces

What we can’t deal with is how tiled surfaces work under a CPU map. A normal surface has an entire scanline mapped to a linear section of memory. This places vertically adjacent pixels a fair distance apart in memory. Drawing a vertical line means touching two different cache lines and two different pages. Even a large cache and TLB will not help much if you draw tall objects. Tiled surfaces arrange for nearby screen pixels to be nearby in memory, usually by constructing the surface from a set of rectangular page-sized tiles. Vertically adjacent pixels will then be in the same page, and can even be in the same cache line in some cases.

The performance benefits for tiled surfaces are obvious; fewer cache and TLB misses. The cost to the hardware is fairly small; just some gates to stir addresses around when fetching and storing pixels. However, the cost to software is fairly large; computing the address of a pixel now involves some fairly ugly computation.

We already managed to make Mesa deal with tiled surfaces. That was fairly easy as Mesa has a single span-based pixel fetch and store architecture. Write new span accessing functions and the rest of the sw rendering code just works.

Fixing the X server 2D software rendering code is another matter entirely — there’s a lot of it, and it all wants to touch memory in a linear fashion. Aaron Plattner from nVidia actually did go and whack fb to make it work; every pixel fetch or store goes through a function call which is passed the nominal linear address of the pixel. These accessor/setter functions then munge that address into the actual tiled address. However, that’s yet another huge performance impact for software rendering.

Hardware De-Tiling

A better solution is to just use the hardware. When a tiled surface is bound to the GTT, it is visible to everyone using linear addresses; those addresses are swizzled in the hardware and head out to memory in tiled form. There’s no performance benefit from the CPU as its TLBs and caches all see the linear address, but it doesn’t have to deal in a non-linear space.

The second benefit of the GTT map is that it lives under a write-combining MTRR, so all accesses to memory are write-combining and not write-back. This eliminates all of the CPU cache coherence issues and leaves us back with the old performance that we know and love — fast writes and really slow reads, but no penalty for switching rapidly between GPU and CPU.

What’s Next?

So, the basic Pixmaps-in-GEM code is up and running in the gem-pixmap branch of my driver repository, git://people.freedesktop.org/~keithp/xf86-video-intel. The next step will be to integrate Carl Worth’s 965 render changes which place all of the temporary data that it uses into GEM objects as well. That will finish the DRI2 enabling work and allow us to provide zero-copy texture-from-pixmap support.

However, before that can really go main-stream, we need to get the GTT object mapping to fix tiled surface support and get back some performance lost to the CPU cache flushing. We’ll see if Kristian is ready with DRI2 tomorrow, if not, I’ll probably spend the day figuring out enough additional parts of the Linux MM code to get my GTT maps working.

Posted Tue Aug 5 23:28:32 2008

Ok, so I didn’t get a lot of time for coding last week. And, this week there’s OSCON, so coding time will be short again. I figured I should spend some time writing up a brief report about where X output is at today.

Output Hotplug

I think this stuff is fairly solid these days, although we don’t have much in the way of auto-detection of monitor connect/disconnect. There are two reasons here:

  1. The hardware notifies the operating system via an interrupt. Given mode setting code in user space, dealing with interrupts is a huge pain and hence hasn’t been hooked up yet (see below).

  2. Analog outputs (VGA, TV) do detection using impedance changes in the output signal path. This means we have to keep them active if we want to detect a connection. That takes a lot of power (about 1W to light up the VGA output without a monitor connected). What we could do is detect when a monitor was unplugged; that’s free.

There are a few other random improvements that are coming soon, like CEA additions to the EDID parsing code. These are additional data blocks that follow the standard EDID data and are used for ‘consumer electronics’ devices. Supporting these should make more HDMI monitors ‘just work’.

Initial Mode Selection

Detecting connected monitors is fine, but one thing we haven’t really solved is what to do when you have more than one connected when the server starts. My initial code would pick one ‘primary’ monitor, light that up at its preferred size and then pick modes for the other monitors which were as close as possible to the primary monitor size without being larger. Obviously, I liked that as it meant my laptop always came up looking correct on the LVDS and my external VGA would show most of the screen.

However, this was reported to confuse a lot of users. I can imagine that starting the X server with one of the outputs connected but not turned on would make for some ‘interesting’ support calls. So, now the X server picks a mode which all outputs can support and uses that everywhere. Sadly, this means that my laptop panel gets some random scaled mode (usually 1024x768) which looks quite awful.

I think we need something better than either of these choices, but I’m not quite sure how it should work.

Kernel Mode Setting

A bunch of people, including Jesse Barnes and Dave Airlie, have been hacking to move the output configuration code into the kernel. This will solve lots of little problems, like how to display kernel panic messages, and how to deal with interrupts for output hotplug.

This code is up and running fairly well these days, but depends on a kernel memory manager to deal with frame buffers. The integration of GEM into the kernel is blocking this work, but I’m hopeful that this will be sorted out in the next couple of weeks.

GEM — the Graphics Execution Manager

Work here was stalled for a few weeks while we sorted out memory channel interleaving issues. Now things are moving again, and we’re working on getting it stable enough to merge into master. That means fixing a few more critical bugs that the Intel QA team has identified.

One of these bugs is that our GL conformance tests weren’t working right; that turned out to be caused by tests reading back data from the frame buffer one pixel at a time. Our read-back path passed through the GEM memory domain code to pull objects back from GTT space to CPU space. That meant flushing the front, back and depth buffers from the CPU cache. With each of those at 16MB, reading a single pixel took long enough that the tests would time-out. Increasing the timeouts to ‘way too long’ is making them run, but tests which would complete in a few hours are now taking days.

We’ve got two different plans for fixing the read-back path:

  1. Use pread to access precisely the data we need. This would involve flushing a single cache line for the tests above.

  2. Mapping the back buffer through the GTT. This would eliminate the need to clflush anything as the GTT mappings are write combining and so reads bypass the cache.

Eric is working on the former, and I’m working on the latter. More news later (this week?) when we see which one wins.

Composite Acceleration

With Owen Taylor’s change to the glyph management code in the server, Eric and Carl were able to change the driver to batch multiple glyph drawing operations into a command buffer. Once Carl had this working, we went from 13000 glyphs/sec to 103000 glyphs/sec. Obviously we’re hoping for even larger improvements as a pure software solution is well over 1 million glyphs/sec. Even still, 103000 glyphs/sec is enough to make my desktop vastly more usable, and using the software path means losing a lot of other useful acceleration.

DRI2 — Redirected Direct Rendering

Right now, direct rendered GL applications (which is the fastest way we can do GL at present) get drawn to a giant screen-sized back buffer and then copied from there to the screen at swap buffers time. Because everyone shares the same back buffer, you get to clip your drawing as if you were drawing directly to the screen. While this normally doesn’t matter much (aside from some performance costs associated with lots of clip rectangles), when you’re running a compositing manager (like compiz), the 3D applications end up ignoring the per-window offscreen pixmap and spam their output directly to the real frame buffer.

DRI2, written by Kristen Høgsberg, solves this by changing how direct rendering works and giving everyone a private back-buffer to draw to. Now, at buffer swap time, that private back-buffer can be copied to the window’s pixmap and compiz is happy.

This work has been around for a few months, but depends on a TTM-based memory manager. That dependency isn’t very strong, and krh has promised to fix it shortly. Once that’s done, getting the GEM driver to support DRI2 won’t take long, and we’ll have our fully composited desktop running. With luck, that’ll happen before September.

Final Words

As you can see, we’re nearing the end of our long X output rework saga, with most of the pieces falling into place in the next month or two.

Posted Mon Jul 21 18:25:43 2008

With our channel-interleaving mostly sorted out, Eric and I spent a short time figuring out what to do about it all. The important thing we learned is that the hardware has two modes: linear and tiled. The CPU always accesses memory in linear mode, while the GPU can either use linear or tiled mode. What is important here is that ‘linear’ mode uses the same interleaving, whether from the GPU or CPU. So, we can only get into trouble if we use tiled mode from the GPU and linear mode from the CPU.

We came up with a fairly simple plan to resolve this issue:

  • Figure out how to automatically detect the precise interleaving configuration of the system. There is a MCH BAR holding the data, and so far most (but not all) machines appear to report what we expect.

  • Add an IOCTL to propose tiling to the kernel, letting the kernel reject the proposal. This gives us a hook to pass back whatever channel interleaving is necessary. When the memory configuration cannot be supported in tiling mode, the call fails and user mode reverts to linear allocations. This will reduce performance, so we want to do it as little as possible.

    A side benefit here is that we get a reliable way of saving tiling data for each buffer — until now, we’ve stored that in the sarea for front/back/depth, but hadn’t any general plan.

  • For now, fail tiling requests on hardware that uses bit 17 in linear mode, or hardware with an “L” shaped memory configuration. Of course, it would be nice if we could find an ”L”-shaped machine to test with.

  • Add return data to this ioctl to report what address bits to stir into the channel select (bit 6) value. Now user-space needn’t change when the hardware configuration does.

  • Provide a way to map buffers through the GTT using the fence registers to de-tile them. This will make tiled buffers appear linear to the CPU. This is required to let us tile X pixmaps as we would otherwise need to use wfb instead of fb for software rendering. We’ve wanted to do this for a while; it’s nice to have the option of using GTT mappings in any case — writes are less hassle as the WC mapping doesn’t require explicit cache flushing. Giving applications the option seems like the best way forward. Of course, when the GTT fills, these mappings will go away. When the application touches a page, it will fault the object back into the GTT.

  • Maybe someday provide enough information back to user space to deal with page-level interleaving information. For bit-17 configurations, we’d need to report back the physical bit-17 information for each page. For “L”-shaped configurations, we’d have to return back whether each page was interleaved at all. Attempting to make this work with paging seems really hard though — you’d have to create some kind of atomic section where user-space would read the interleave information, compute pixel addresses and access the data. Icky.

On Wednesday and Thursday, Jesse Barnes, Zou Nan Hai and I got to attend the 2008 Intel Gfxcon. Put on by Intel GDG (no, I don’t know what that means, but it’s the Intel integrated graphics group), it was held in Folsom, CA. The trip down on the Intel shuttle was uneventful, but on arrival I found the 42° air full of smoke from wild fires. My plan to bicycle from the hotel to Intel suddenly seemed like it would be a lot less fun.

The conference was huge fun. As usual, meeting people in person is always better than via email or even on a teleconference. After working at Intel for nearly three years now, I’m starting to feel a bit more a part of the organization and less of an outsider. Linux continues to gain mindshare within the company as it gains visibility in our customers’ products.

I got to present our current GEM work to a fair-sized crowd, including several people from our Windows driver development team. I was interested to hear what they thought about the architecture and was pleased to learn that a lot of what we’re doing is similar to how the Vista driver works. While we can’t share source code, it is at least nice that we can share ideas about how best to drive the hardware at the lowest levels of the system.

That evening, Jesse, Nan Hai and I managed to find decent steak-and-potatoes, but our attempt to locate gelato ended in near-failure — Google led us to a mini-mall in a neighboring town. Upon failing to locate the expected restaurant, we enquired in an Italian place who first told us that the gelato place had closed five years earlier, and then convinced us that they could provide gelato. Served freezer-burnt ice cream, we quickly left for our respective lodgings.

The next evening we found acceptable California-style Mexican food. As usual, portions were large enough to push any thought of desert from our minds. We lumbered back to Nan Hai’s hotel room and hacked for several hours, although the free wifi there left a lot to be desired. Then, we chatted about how to get rid of tearing in textured video.

For vblank synchronized textured video, I’m hoping we’ll be able to queue the update to the kernel and have it perform the necessary blts. This would mean interrupting the current command stream and switching over to a separate stream. It probably means using separate hardware contexts, which would be a good thing in any case as we could eliminate the per-batch-buffer configuration that the 3D driver currently performs. Work on this will need to start with multiple hardware context support, then move on to interrupting the ring and then figuring out how to manage the blts along with clip list changes etc.

Posted Mon Jul 14 22:19:41 2008

Eric and I have been busy hacking away at GEM, the Graphics Execution Manager. GEM is a memory and command ring manager for Intel integrated graphics. GEM itself is working out quite well; we haven’t found any terrible surprises in that area. I thought I’d take a few minutes and write up some of the things we’ve discovered and what we’re doing.

Kernel Patches

First off, I’ve published the GEM kernel patches so that anyone can give this code a try. The first patch was so trivial (exporting shmem_file_setup) that I didn’t bother, but the patch to export shmem_getpage is quite a bit longer as it also has to export an associated enum.

There’s also a patch for the agp driver which re-writes the GATT on resume. That one isn’t GEM-specific, but as we assume the GATT is preserved across VT switches, it’s necessary to make GEM survive suspend/resume. It has also been accepted into -mm and should land upstream sometime soon.

Writing data to the GPU

One of the central ideas in GEM is the recognition that cache management plays a huge role in moving data between the CPU and GPU. Because the CPU and GPU are not cache coherent, applications must either use uncached writes from the CPU or explicitly flush the CPU cache to get data transferred. There are several different ways of doing uncached writes:

  1. Uncached page table entries. The requirement here is that all mappings to this page must be uncached, so you can’t simply create an uncached mapping when you want writes to be flushed out. The page must be flushed from all CPU caches at allocation time. Worse — the TLBs of all CPUs must be synchronized so that everyone agrees on the caching mode for every page. While flushing the CPU caches isn’t terrible as we can use clflush, the TLB flush requires an IPI, making the allocation of uncached pages very expensive.

  2. Writing through a suitable MTRR-mapped address. This is how we’ve always done writes to the GPU in the past — the graphics aperture is covered by write-combining MTRR entries so that writes will be sent to memory. This requires that the destination pages be mapped through the GATT so that they appear under the graphics aperture, which (again) requires that the page contents be flushed from CPU caches. However, TLB entries needn’t be flushed as we aren’t changing any PTEs.

  3. Non-temporal stores (movnti, movntq, etc). The kernel already uses these when copying data around to avoid filling the cache with useless data. However, non-temporal stores don’t actually guarantee that data won’t end up sitting in the cache. In particular, if the destination is already sitting in a cache line, then the store will not force that cache line to be flushed. So, while this avoids filling the cache with a lot of additional data, it doesn’t provide the necessary guarantee that data will be visible to the GPU.

  4. Using clflush. This makes sure all CPU caches are flushed and the data written to memory. clflush isn’t cheap, but as it uses the cache coherence protocols, it need only run on one CPU. Combine this with non-temporal stores and you get a fairly cache-friendly mechanism without the cost of uncached page allocation.

If the kernel offered a cheap way of allocating uncached (or write combining) pages (presumably by constructing a pool of pages ready for uncached use), that might be interesting. However, uncached writes also means uncached reads, and sometimes we do read data back from the GPU. So, we’re ignoring this option at present.

Writing data through a write-combining MTRR has always worked reasonably well; there were older processors for which WC writes were slower than WB, but that (fortunately) is no longer true. The big drawback in using the MTRR is that we must allocate a portion of the limited GATT for objects that we want to access in this way. If we want to swap these pages, or need GATT space for other objects we would have to remove them from the GATT. The performance issue here (again) is that reads through a WC mapping are uncached, which means dramatically lower performance.

So, what does all of this mean in the GEM context?

First off, we want to try and treat CPU->GPU data transfers as I/O instead of memory-mapping. This means knowing precisely what data are being moved and when that happens. If we map objects with caching enabled to avoid GATT fun and improve read performance, then when those objects are moved back to the GPU, we must assume that they are entirely dirtied and flush the whole object. Treating this as I/O means having the kernel do all of the writes, which allows all kinds of flexibility on mapping.

Secondly, for operations that can’t easily be treated as I/O, it means making explicit choices about where to map objects. When reading an object from the CPU (as when using texture data), we certainly don’t want to use an uncached mapping — if that object isn’t written by the CPU, then we needn’t flush when switching back to the GPU either. However, for rendering targets as large as the frame buffer, mapping them cached means performing an enormous clflush sequence when moving back to the GPU, so we probably need to make the GATT-based WC mappings work. Currently, GEM doesn’t manage this — all objects are mapped cached, so software fallbacks end up doing a lot of cache flushing.

Using the I/O model to write data from user space into buffers for the GPU leaves us with some flexibility in the kernel implementation. We’ve tried two different mechanisms and I’m working on a third:

  1. Use the existing pwrite code from the shmem file system. Follow that with calls to clflush when the object is mapped to the GPU. This works out very well when the batch buffers were full, but partially filled buffers end up causing unnecessary clflush calls. Also, the clflush requires an extra kmap_atomic/kunmap_atomic pair.

  2. Map the object to the GATT and then use kmap_atomic_prot_pfn to map pages transiently into kernel space. This gives us WC write performance, eliminating any need to use clflush. Performance is improved, but it abuses kmap_atomic_pfn — a function which is only really supposed to be given physical memory pages. On kernels with CONFIG_HIGHMEM set, it works out fine, but without that, you get a garbage PTE. Performance here is quite a bit better, eliminating flushing from profiles.

  3. Hand-code the pwrite function to map the pages, copy the data and flush the cache all in one step. I’m hopeful that this will end up as fast as the GATT-based scheme, but avoid the abuse of the kmap_atomic_pfn function.

The first scheme exposed the flushing as an expensive operation; profiles for typical games would have flushing taking 5-7% of the CPU. The second scheme eliminated that, raising performance and lowering CPU usage. We’ll see if the third scheme is successful; if not, we’ll have to lobby the kernel developers to give us a supported way of transiently mapping I/O devices from kernel space.

Tiling and memory channels

Mapping graphical objects in a linear frame buffer where the data for each scanline is neatly arranged together in memory is an obvious representation for the data; it makes constructing scan-out hardware easy, and also makes writing software rendering code easier as well. Unfortunately, graphical objects generally span adjacent portions of many scanlines. Accessing memory in this order generally runs counter to memory architectures; a vertical line will end up writing one pixel in one cache line of one page. You end up spending a huge amount of time reading/writing cache lines and refilling TLB entries.

The usual solution to this is to have a single page hold pixels for multiple scanlines in the same region of the screen. Tiling the screen with these blocks of pixels provides dramatic performance improvements (we see about a 50% performance improvement from tiling the back buffer). Intel hardware supports two different tiling modes. With ‘X’ tiling, each page forms a rectangle that is 512 bytes wide by 8 scanlines high. ‘Y’ tiles are 128 bytes wide by 32 scanlines high.

A separate, but related issue is dealing with multiple memory channels. To see maximum memory bandwidth, the system needs to interleave access between memory channels. The memory system is arranged so that successive cache lines come from alternate memory channels, which means that address bit 6 ends up being the ‘channel select’ bit. This is related, because tiled graphics breaks the assumption about sequential access — walk down a tiled buffer and you would hit the same memory channel each time.

To fix this, the hardware actually modifies address bit 6 using other portions of the address. For X tiling, it xor’s in bit 9 and 10 of the address when computing bit six; this means that vertically adjacent pixels are always in alternate memory channels. Y tiling uses only bit 9, but the pixels are already stirred around in that format enough that this one bit suffices.

The CPU doesn’t share in this particular adventure, so when it accesses these objects directly (not through the GATT), it sees things mixed around.

Of course, sometimes the hardware doesn’t bother swizzling bit 6 like this; if you have only a single memory channel, it doesn’t help. But, neither does it hurt, so some hardware will swizzle even in this case. We haven’t found any registers that tell us when the swizzling is going on.

Not to be left out of the bit 6 fun, the CPU-facing memory controller also improves interleaving by mixing bits up. It can either stir in bit 11 or bit (uh-oh) 17. At least this behavior is documented and visible in a register visible to the CPU. Bit 11 is workable; we just stir that into the mix when computing bit 6 to unswizzle before the memory controller re-swizzles and things work out fine. Bit 17 is problematic. It’s not a virtual address bit, it’s a physical address bit. Which means that the physical memory layout of data stored in RAM depends on where in memory a page is sitting. Move the page around so that bit-17 changes and the data will flip channels.

Of course, as the GPU does its own bit-6 swizzling for tiled objects, it doesn’t bother with the CPU memory swizzle. Which means that tiled data written by the CPU and read by the GPU will appear to flip around, depending on where in physical memory that data resides.

All of these bit-6 adventures are holding up GEM development at present; software fallbacks reading or writing to tiled objects are quite broken on many machines, and the way they’re broken depends on how the CPU and GPU memory controllers are set up.

We already have code that does the GPU swizzling and only need to add auto-detection to know when to use it. But the bit-17 CPU swizzling may cause some significant problems. First off, we’d have to make all CPU access to tiled objects go through the GATT, hurting read performance and complicating our mapping code — user mode doesn’t know anything about physical addresses and so couldn’t swizzle. Secondly, we would have to find some way to ensure that bit 17 of all tiled pages didn’t change across swap operations (as swapping will read and write through the CPU swizzle). That would either mean pinning tiled objects in memory (ouch), or hacking up the kernel memory manager to add a very strange constraint on page allocation, or swizzle pages before mapping them to the GATT.

Posted Fri Jul 4 13:25:03 2008

All Entries