Core Rendering with Glamor

I’ve hacked up the intel driver to bypass all of the UXA paths when Glamor is enabled so I’m currently running an X server that uses only Glamor for all rendering. There are still too many fall backs, and performance for some operations is not what I’d like, but it’s entirely usable. It supports DRI3, so I even have GL applications running.

Core Rendering Status

I’ve continued to focus on getting the core X protocol rendering operations complete and correct; those remain a critical part of many X applications and are a poor match for GL. At this point, I’ve got accelerated versions of the basic spans functions, filled rectangles, text and copies.

GL and Scrolling

OpenGL has been on a many-year vendetta against one of the most common 2D accelerated operations — copying data within the same object, even when that operation overlaps itself. This used to be the most performance-critical operation in X; it was used for scrolling your terminal windows and when moving windows around on the screen.

Reviewing the OpenGL 3.x spec, Eric and I both read the glCopyPixels specification as clearly requiring correct semantics for overlapping copy operations — it says that the operation must be equivalent to reading the pixels and then writing the pixels. My CopyArea acceleration thus uses this path for the self-copy case. However, the ARB decided that having a well defined blt operation was too nice to the users, so the current 4.4 specification adds explicit language to assert that this is not well defined anymore (even in the face of the existing language which is pretty darn unambiguous).

I suspect we’ll end up creating an extension that offers what we need here; applications are unlikely to stop scrolling stuff around, and GPUs (at least Intel) will continue to do what we want. This is the kind of thing that makes GL maddening for 2D graphics — the GPU does what we want, and the GL doesn’t let us get at it.

For implementations not capable of supporting the required semantic, someone will presumably need to write code that creates a temporary copy of the data.

PBOs for fall backs

For operations which Glamor can’t manage, we need to fall back to using a software solution. Direct-to-hardware acceleration architectures do this by simply mapping the underlying GPU object to the CPU. GL doesn’t provide this access, and it’s probably a good thing as such access needs to be carefully synchronized with GPU access, and attempting to access tiled GPU objects with the CPU require either piles of CPU code to ‘de-tile’ accesses (ala wfb), or special hardware detilers (like the Intel GTT).

However, GL does provide a fairly nice abstraction called pixel buffer objects (PBOs) which work to speed access to GPU data from the CPU.

The fallback code allocates a PBO for each relevant X drawable, asks GL to copy pixels in, and then calls fb, with the drawable now referencing the temporary buffer. On the way back out, any potentially modified pixels are copied back through GL and the PBOs are freed.

This turns out to be dramatically faster than malloc’ing temporary buffers as it allows the GL to allocate memory that it likes, and for it to manage the data upload and buffer destruction asynchronously.

Because X pixmaps can contain many X windows (the root pixmap being the most obvious example), they are often significantly larger than the actual rendering target area. As an optimization, the code only copies data from the relevant area of the pixmap, saving considerable time as a result. There’s even an interface which further restricts that to a subset of the target drawable which the Composite function uses.

Using Scissoring for Clipping

The GL scissor operation provides a single clipping rectangle. X provides a list of rectangles to clip to. There are two obvious ways to perform clipping here — either perform all clipping in software, or hand each X clipping rectangle in turn to GL and re-execute the entire rendering operation for each rectangle.

You’d think that the former plan would be the obvious choice; clearly re-executing the entire rendering operation potentially many times is going to take a lot of time in the GPU.

However, the reality is that most X drawing occurs under a single clipping rectangle. Accelerating this common case by using the hardware clipper provides enough benefit that we definitely want to use it when it works. We could duplicate all of the rendering paths and perform CPU-based clipping when the number of rectangles was above some threshold, but the additional code complexity isn’t obviously worth the effort, given how infrequently it will be used. So I haven’t bothered. Most operations look like this:

Allocate VBO space for data

Fill VBO with X primitives

loop over clip rects {

This obviously out-sources as much of the problem as possible to the GL library, reducing the CPU time spent in glamor to a minimum.

A Peek at Some Code

With all of these changes in place, drawing something like a list of rectangles becomes a fairly simple piece of code:

First, make sure the program we want to use is available and can be used with our GC configuration:

prog = glamor_use_program_fill(pixmap, gc,

if (!prog)
    goto bail_ctx;

Next, allocate the VBO space and copy all of the X data into it. Note that the data transfer is simply ‘memcpy’ here — that’s because we break the X objects apart in the vertex shader using instancing, avoiding the CPU overhead of computing four corner coordinates.

/* Set up the vertex buffers for the points */

v = glamor_get_vbo_space(drawable->pScreen, nrect * (4 * sizeof (GLshort)), &vbo_offset);

glVertexAttribDivisor(GLAMOR_VERTEX_POS, 1);
                      4 * sizeof (GLshort), vbo_offset);

memcpy(v, prect, nrect * sizeof (xRectangle));


Finally, loop over the pixmap tile fragments, and then over the clip list, selecting the drawing target and painting the rectangles:


glamor_pixmap_loop(pixmap_priv, box_x, box_y) {
    int nbox = RegionNumRects(gc->pCompositeClip);
    BoxPtr box = RegionRects(gc->pCompositeClip);

    glamor_set_destination_drawable(drawable, box_x, box_y, TRUE, FALSE, prog->matrix_uniform, &off_x, &off_y);

    while (nbox--) {
        glScissor(box->x1 + off_x,
                  box->y1 + off_y,
                  box->x2 - box->x1,
                  box->y2 - box->y1);
        glDrawArraysInstanced(GL_TRIANGLE_STRIP, 0, 4, nrect);

GL texture size limits

X pixmaps use 16 bit dimensions for width and height, allowing them to be up to 65536 x 65536 pixels. Because the X coordinate space is signed, only a quarter of this space is actually useful, which makes the useful size of X pixmaps only 32767 x 32767. This is still larger than most GL implementations offer as a maximum texture size though, and while it would be nice to just say ‘we don’t allow pixmaps larger than GL textures’, the reality is that many applications expect to be able to allocate such pixmaps today, largely to hold the ever increasing size of digital photographs.

Glamor has always supported large X pixmaps; it does this by splitting them up into tiles, each of which is no larger than the largest texture supported by the driver. What I’ve added to Glamor is some simple macros that walk over the array of tiles, making it easy for the rendering code to support large pixmaps without needing any special case code.

Glamor also had some simple testing support — you can compile the code to ignore the system-provided maximum texture size and supply your own value. This code had gone stale, and couldn’t work as there were parts of the code for which tiling support just doesn’t make sense, like the glyph cache, or the X scanout buffer. I fixed things so that you could leave those special cases as solitary large tiles while breaking up all other pixmaps into tiles no larger than 32 pixels square.

I hope to remove the single-tile case and leave the code supporting only the multiple-tile case; we have to have the latter code, and so having the single-tile code around simply increases our code size for not obvious benefit.

Getting accelerated copies between tiled pixmaps added a new coordinate system to the mix and took a few hours of fussing until it was working.

Rebasing Many (many) Times

I’m sure most of us remember the days before git; changes were often monolithic, and the notion of changing how the changes were made for the sake of clarity never occurred to anyone. It used to be that the final code was the only interesting artifact; how you got there didn’t matter to anyone. Things are different today; I probably spend a third of my development time communicating how the code should change with other developers by changing the sequence of patches that are to be applied.

In the case of Glamor, I’ve now got a set of 28 patches. The first few are fixes outside of the glamor tree that make the rest of the server work better. Then there are a few general glamor infrastructure additions. After that, each core operation is replaced, one a at a time. Finally, a bit of stale code is removed. By sequencing things in a logical fashion, I hope to make review of the code easier, which should mean that people will spend less time trying to figure out what I did and be able to spend more time figuring out if what I did is reasonable and correct.

Supporting Older Versions of GL

All of the new code uses vertex instancing to move coordinate computation from the CPU to the GPU. I’m also pulling textures apart using integer operations. Right now, we should correctly fall back to software for older hardware, but it would probably be nicer to just fall back to simpler GL instead. Unless everyone decides to go buy hardware with new enough GL driver support, someone is going to need to write simplified code paths for glamor.

If you’ve got such hardware, and are interested in making it work well, please take this as an opportunity to help yourself and others.

Near-term Glamor Goals

I’m pretty sure we’ll have the code in good enough shape to merge before the window closes for X server 1.16. Eric is in charge of the glamor tree, so it’s up to him when stuff is pulled in. He and Markus Wick have also been generating code and reviewing stuff, but we could always use additional testing and review to make the code as good as possible before the merge window closes.

Markus has expressed an interest in working on Glamor as a part of the summer of code this year; there’s clearly plenty of work to do here, Eric and I haven’t touched the render acceleration stuff at all, and that code could definitely use some updating to use more modern GL features.

If that works as well as the core rendering code changes, then we can look forward to a Glamor which offers GPU-limited performance for classic X applications, without requiring GPU-specific drivers for every generation of every chip.

Posted Sat Mar 22 01:20:34 2014 Tags: fdo

Brief Glamor Hacks

Eric Anholt started writing Glamor a few years ago. The goal was to provide credible 2D acceleration based solely on the OpenGL API, in particular, to implement the X drawing primitives, both core and Render extension, without any GPU-specific code. When he started, the thinking was that fixed-function devices were still relevant, so that original code didn’t insist upon “modern” OpenGL features like GLSL shaders. That made the code less efficient and hard to write.

Glamor used to be a side-project within the X world; seen as something that really wasn’t very useful; something that any credible 2D driver would replace with custom highly-optimized GPU-specific code. Eric and I both hoped that Glamor would turn into something credible and that we’d be able to eliminate all of the horror-show GPU-specific code in every driver for drawing X text, rectangles and composited images. That hadn’t happened though, until now.

Fast forward to the last six months. Eric has spent a bunch of time cleaning up Glamor internals, and in fact he’s had it merged into the core X server for version 1.16 which will be coming up this July. Within the Glamor code base, he’s been cleaning some internal structures up and making life more tolerable for Glamor developers.

Using libepoxy

A big part of the cleanup was a transition all of the extension function calls to use his other new project, libepoxy, which provides a sane, consistent and performant API to OpenGL extensions for Linux, Mac OS and Windows. That library is awesome, and you should use it for everything you do with OpenGL because not using it is like pounding nails into your head. Or eating non-tasty pie.

Using VBOs in Glamor

One thing he recently cleaned up was how to deal with VBOs during X operations. VBOs are absolutely essential to modern OpenGL applications; they’re really the only way to efficiently pass vertex data from application to the GPU. And, every other mechanism is deprecated by the ARB as not a part of the blessed ‘core context’.

Glamor provides a simple way of getting some VBO space, dumping data into it, and then using it through two wrapping calls which you use along with glVertexAttribPointer as follows:

pointer = glamor_get_vbo_space(screen, size, &offset);
glVertexAttribPointer(attribute_location, count, type,
              GL_FALSE, stride, offset);
memcpy(pointer, data, size);

glamor_get_vbo_space allocates the specified amount of VBO space and returns a pointer to that along with an ‘offset’, which is suitable to pass to glVertexAttribPointer. You dump your data into the returned pointer, call glamor_put_vbo_space and you’re all done.

Actually Drawing Stuff

At the same time, Eric has been optimizing some of the existing rendering code. But, all of it is still frankly terrible. Our dream of credible 2D graphics through OpenGL just wasn’t being realized at all.

On Monday, I decided that I should go play in Glamor for a few days, both to hack up some simple rendering paths and to familiarize myself with the insides of Glamor as I’m getting asked to review piles of patches for it, and not understanding a code base is a great way to help introduce serious bugs during review.

I started with the core text operations. Not because they’re particularly relevant these days as most applications draw text with the Render extension to provide better looking results, but instead because they’re often one of the hardest things to do efficiently with a heavy weight GPU interface, and OpenGL can be amazingly heavy weight if you let it.

Eric spent a bunch of time optimizing the existing text code to try and make it faster, but at the bottom, it actually draws each lit pixel as a tiny GL_POINT object by sending a separate x/y vertex value to the GPU (using the above VBO interface). This code walks the array of bits in the font and checking each one to see if it is lit, then checking if the lit pixel is within the clip region and only then adding the coordinates of the lit pixel to the VBO. The amazing thing is that even with all of this CPU and GPU work, the venerable 6x13 font is drawn at an astonishing 3.2 million glyphs per second. Of course, pure software draws text at 9.3 million glyphs per second.

I suspected that a more efficient implementation might be able to draw text a bit faster, so I decided to just start from scratch with a new GL-based core X text drawing function. The plan was pretty simple:

  1. Dump all glyphs in the font into a texture. Store them in 1bpp format to minimize memory consumption.

  2. Place raw (integer) glyph coordinates into the VBO. Place four coordinates for each and draw a GL_QUAD for each glyph.

  3. Transform the glyph coordinates into the usual GL range (-1..1) in the vertex shader.

  4. Fetch a suitable byte from the glyph texture, extract a single bit and then either draw a solid color or discard the fragment.

This makes the X server code surprisingly simple; it computes integer coordinates for the glyph destination and glyph image source and writes those to the VBO. When all of the glyphs are in the VBO, it just calls glDrawArrays(GL_QUADS, 0, 4 * count). The results were “encouraging”:

1: fb-text.perf
2: glamor-text.perf
3: keith-text.perf

       1                 2                           3                 Operation
------------   -------------------------   -------------------------   -------------------------
   9300000.0      3160000.0 (     0.340)     18000000.0 (     1.935)   Char in 80-char line (6x13) 
   8700000.0      2850000.0 (     0.328)     16500000.0 (     1.897)   Char in 70-char line (8x13) 
   6560000.0      2380000.0 (     0.363)     11900000.0 (     1.814)   Char in 60-char line (9x15) 
   2150000.0       700000.0 (     0.326)      7710000.0 (     3.586)   Char16 in 40-char line (k14) 
    894000.0       283000.0 (     0.317)      4500000.0 (     5.034)   Char16 in 23-char line (k24) 
   9170000.0      4400000.0 (     0.480)     17300000.0 (     1.887)   Char in 80-char line (TR 10) 
   3080000.0      1090000.0 (     0.354)      7810000.0 (     2.536)   Char in 30-char line (TR 24) 
   6690000.0      2640000.0 (     0.395)      5180000.0 (     0.774)   Char in 20/40/20 line (6x13, TR 10) 
   1160000.0       351000.0 (     0.303)      2080000.0 (     1.793)   Char16 in 7/14/7 line (k14, k24) 
   8310000.0      2880000.0 (     0.347)     15600000.0 (     1.877)   Char in 80-char image line (6x13) 
   7510000.0      2550000.0 (     0.340)     12900000.0 (     1.718)   Char in 70-char image line (8x13) 
   5650000.0      2090000.0 (     0.370)     11400000.0 (     2.018)   Char in 60-char image line (9x15) 
   2000000.0       670000.0 (     0.335)      7780000.0 (     3.890)   Char16 in 40-char image line (k14) 
    823000.0       270000.0 (     0.328)      4470000.0 (     5.431)   Char16 in 23-char image line (k24) 
   8500000.0      3710000.0 (     0.436)      8250000.0 (     0.971)   Char in 80-char image line (TR 10) 
   2620000.0       983000.0 (     0.375)      3650000.0 (     1.393)   Char in 30-char image line (TR 24)

This is our old friend x11perfcomp, but slightly adjusted for a modern reality where you really do end up drawing billions of objects (hence the wider columns). This table lists the performance for drawing a range of different fonts in both poly text and image text variants. The first column is for Xephyr using software (fb) rendering, the second is for the existing Glamor GL_POINT based code and the third is the latest GL_QUAD based code.

As you can see, drawing points for every lit pixel in a glyph is surprisingly fast, but only about 1/3 the speed of software for essentially any size glyph. By minimizing the use of the CPU and pushing piles of work into the GPU, we manage to increase the speed of most of the operations, with larger glyphs improving significantly more than smaller glyphs.

Now, you ask how much code this involved. And, I can truthfully say that it was a very small amount to write:        |    2 
 glamor.c           |    5 
 glamor_core.c      |    8 
 glamor_font.c      |  181 ++++++++++++++++++++
 glamor_font.h      |   50 +++++
 glamor_priv.h      |   26 ++
 glamor_text.c      |  472 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 glamor_transform.c |    2 
 8 files changed, 741 insertions(+), 5 deletions(-)

Let’s Start At The Very Beginning

The results of optimizing text encouraged me to start at the top of x11perf and see what progress I could make. In particular, looking at the current Glamor code, I noticed that it did all of the vertex transformation with the CPU. That makes no sense at all for any GPU built in the last decade; they’ve got massive numbers of transistors dedicated to performing precisely this kind of operation. So, I decided to see what I could do with PolyPoint.

PolyPoint is absolutely brutal on any GPU; you have to pass it two coordinates for each pixel, and so the very best you can do is send it 32 bits, or precisely the same amount of data needed to actually draw a pixel on the frame buffer. With this in mind, one expects that about the best you can do compared with software is tie. Of course, the CPU version is actually computing an address and clipping, but those are all easily buried in the cost of actually storing a pixel.

In any case, the results of this little exercise are pretty close to a tie — the CPU draws 190,000,000 dots per second and the GPU draws 189,000,000 dots per second. Looking at the vertex and fragment shaders generated by the compiler, it’s clear that there’s room for improvement.

The fragment shader is simply pulling the constant pixel color from a uniform and assigning it to the fragment color in this the simplest of all possible shaders:

uniform vec4 color;
void main()
       gl_FragColor = color;

This generates five instructions:

Native code for point fragment shader 7 (SIMD8 dispatch):
   FB write target 0
0x00000000: mov(8)          g113<1>F        g2<0,1,0>F                      { align1 WE_normal 1Q };
0x00000010: mov(8)          g114<1>F        g2.1<0,1,0>F                    { align1 WE_normal 1Q };
0x00000020: mov(8)          g115<1>F        g2.2<0,1,0>F                    { align1 WE_normal 1Q };
0x00000030: mov(8)          g116<1>F        g2.3<0,1,0>F                    { align1 WE_normal 1Q };
0x00000040: sendc(8)        null            g113<8,8,1>F
                render ( RT write, 0, 4, 12) mlen 4 rlen 0      { align1 WE_normal 1Q EOT };
   END B0

As this pattern is actually pretty common, it turns out there’s a single instruction that can replace all four of the moves. That should actually make a significant difference in the run time of this shader, and this shader runs once for every single pixel.

The vertex shader has some similar optimization opportunities, but it only runs once for every 8 pixels — with the SIMD format flipped around, the vertex shader can compute 8 vertices in parallel, so it ends up executing 8 times less often. It’s got some redundant moves, which could be optimized by improving the copy propagation analysis code in the compiler.

Of course, improving the compiler to make these cases run faster will probably make a lot of other applications run faster too, so it’s probably worth doing at some point.

Again, the amount of code necessary to add this path was tiny:        |    1 
 glamor.c           |    2 
 glamor_polyops.c   |  116 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 glamor_priv.h      |    8 +++
 glamor_transform.c |  118 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 glamor_transform.h |   51 ++++++++++++++++++++++
 6 files changed, 292 insertions(+), 4 deletions(-)

Discussion of Results

These two cases, text and points, are probably the hardest operations to accelerate with a GPU and yet a small amount of OpenGL code was able to meet or beat software easily. The advantage of this work over traditional GPU 2D acceleration implementations should be pretty clear — this same code should work well on any GPU which offers a reasonable OpenGL implementation. That means everyone shares the benefits of this code, and everyone can contribute to making 2D operations faster.

All of these measurements were actually done using Xephyr, which offers a testing environment unlike any I’ve ever had — build and test hardware acceleration code within a nested X server, debugging it in a windowed environment on a single machine. Here’s how I’m running it:

$ ./Xephyr  -glamor :1 -schedMax 2000 -screen 1024x768 -retro

The one bit of magic here is the ‘-schedMax 2000’ flag, which causes Xephyr to update the screen less often when applications are very busy and serves to reduce the overhead of screen updates while running x11perf.

Future Work

Having managed to accelerate 17 of the 392 operations in x11perf, it’s pretty clear that I could spend a bunch of time just stepping through each of the remaining ones and working on them. Before doing that, we want to try and work out some general principles about how to handle core X fill styles. Moving all of the stipple and tile computation to the GPU will help reduce the amount of code necessary to fill rectangles and spans, along with improving performance, assuming the above exercise generalizes to other primitives.

Getting and Testing the Code

Most of the changes here are from Eric’s glamor-server branch:

git:// glamor-server

The two patches shown above, along with a pair of simple clean up patches that I’ve written this week are available here:

git:// glamor-server

Of course, as this now uses libepoxy, you’ll need to fetch, build and install that before trying to compile this X server.

Because you can try all of this out in Xephyr, it’s easy to download and build this X server and then run it right on top of your current driver stack inside of X. I’d really like to hear from people with Radeon or nVidia hardware to know whether the code works, and how it compares with fb on the same machine, which you get when you elide the ‘-glamor’ argument from the example Xephyr command line above.

Posted Fri Mar 7 01:12:28 2014 Tags: fdo

Missing FOSDEM

I’m afraid Eric and I won’t be at FOSDEM this weekend; our flight got canceled, and the backup they offered would have gotten us there late Saturday night. It seemed crazy to fly to Brussels for one day of FOSDEM, so we decided to just stay home and get some work done.

Sorry to be missing all of the fabulous FOSDEM adventures and getting to see all of the fun people who attend one of the best conferences around. Hope everyone has a great time, and finds only the best chocolates.

Posted Fri Jan 31 18:22:54 2014 Tags: fdo

X bitmaps vs OpenGL

Of course, you all know that X started life as a monochrome window system for the VS100. Back then, bitmaps and rasterops were cool; you could do all kinds of things simple bit operations. Things changed, and eventually X bitmaps became useful only off-screen for clip masks, text and stipples. These days, you’ll rarely see anyone using a bitmap — everything we used to use bitmaps for has gone all alpha-values on us.

In OpenGL, there aren’t any bitmaps. About the most ‘bitmap-like’ object you’ll find is an A8 texture, holding 8 bits of alpha value for each pixel. There’s no way to draw to or texture from anything where each pixel is represented as a single bit.

So, as Eric goes about improving Glamor, he got a bit stuck with bitmaps. We could either:

  • Support them only on the CPU, uploading copies as A8 textures when used as a source in conjunction with GPU objects.

  • Support them as 1bpp on the CPU and A8 on the GPU, doing fancy tracking between the two objects when rendering occurred.

  • Fix the CPU code to deal with bitmaps stored 8 bits per pixel.

I thought the latter choice would be the best plan — directly share the same object between CPU and GPU rendering, avoiding all reformatting as things move around in the server.

Why is this non-trivial?

Usually, you can flip formats around with reckless abandon in X, it has separate bits-per-pixel and depth values everywhere. That’s how we do things like 32 bits-per-pixel RGB surfaces; we just report them as depth 24 and everyone is happy.

Bitmaps are special though. The X protocol has separate (and overly complicated) image formats for single bit images, and those have to be packed 1 bit per pixel. Within the server, bitmaps are used for drawing core text, stippling and creating clip masks. They’re the ‘lingua franca’ of image formats, allowing you to translate between depths by pulling a single “plane” out of a source image and painting it into a destination of arbitrary depth.

As such, the frame buffer rendering code in the server assumes that bitmaps are always 1 bit per pixel. Given that it must deal with 1bpp images on the wire, and given the history of X, it certainly made sense at the time to simplify the code with this assumption.

A blast from the past

I’d actually looked into doing this before. As part of the DECstation line, DEC built the MX monochrome frame buffer board, and to save money, they actually created it by populating a single bit in each byte rather than packed into 8 pixels per byte. I have this vague memory that they were able to use only 4 memory chips this way.

The original X driver for this exposed a depth-8 static color format because of the assumptions made by the (then current) CFB code about bitmap formats.

Jim Gettys wandered over to MIT while the MX frame buffer was in design and asked how hard it would be to support it as a monochrome device instead of the depth-8 static color format. At the time, fixing CFB would have been a huge effort, and there wasn’t any infrastructure for separating the wire image format from the internal pixmap format. So, we gave up and left things looking weird to applications.

Hacking FB

These days, the amount of frame buffer code in the X server is dramatically less; CFB and MFB have been replaced with the smaller (and more general) FB code. It turns out that the number of places which need to deal with individual bits in a bitmap are now limited to a few stippling and CopyPlane functions. And, in those functions, the number of individual read operations from the bitmap are few in number. Each of those fetches looked like:

bits = READ(src++)

All I needed to do was make this operation return 32 bits by pulling one bit from each of 8 separate 32-bit chunks and merge them together. The first thing to do was to pad the pixmap out to a 32 byte boundary, rather than a 32 bit boundary. This ensured that I would always be able to fetch data from the bitmap in 8 32-bit chunks. Next, I simply replaced the READ macro call with:

    bits = fb_stip_read(src, srcBpp);
    src += srcBpp;

The new fb_stip_read function checks srcBpp and packs things together for 8bpp images:

 * Given a depth 1, 8bpp stipple, pull out
 * a full FbStip worth of packed bits
static inline FbStip
fb_pack_stip_8_1(FbStip *bits) {
    FbStip      r = 0;
    int         i;

    for (i = 0; i < 8; i++) {
        FbStip  b;
        uint8_t p;

        b = FB_READ(bits++);
        p = (b & 1) | ((b >> 7) & 2) | ((b >> 14) & 4) | ((b >> 21) & 8);
        r |= p << (i << 2);
        p = (b & 0x80000000) | ((b << 7) & 0x40000000) |
            ((b << 14) & 0x20000000) | ((b << 21) & 0x10000000);
        r |= p >> (i << 2);
    return r;

 * Return packed stipple bits from src
static inline FbStip
fb_stip_read(FbStip *bits, int bpp)
    switch (bpp) {
        return FB_READ(bits);
    case 8:
        return fb_pack_stip_8_1(bits);

It turns into a fairly hefty amount of code, but the number of places this ends up being used is pretty small, so it shouldn’t increase the size of the server by much. Of course, I’ve only tested the LSBFirst case, but I think the MSBFirst code is correct.

I’ve sent the patches to do this to the xorg-devel mailing list, and they’re also on the ‘depth1’ branch in my repository



Eric also hacked up the test suite to be runnable by piglit, and I’ve run it in that mode against these changes. I had made a few mistakes, and the test suite caught them nicely. Let’s hope this adventure helps Eric out as he continues to improve Glamor.

Posted Mon Jan 13 18:52:30 2014 Tags: fdo

Pixmap ID Lifetimes under Present Redirection (Part Deux)

I recently wrote about pixmap content and ID lifetimes. I think the pixmap content lifetime stuff was useful, but the pixmap ID stuff was not quite right. I’ve just copied the section from the previous post and will update it.

PresentRedirect pixmap ID lifetimes (reprise)

A compositing manager will want to know the lifetime of the pixmaps delivered in PresentRedirectNotify events to clean up whatever local data it has associated with it. For instance, GL compositing managers will construct textures for each pixmap that need to be destroyed when the pixmap disappears.

Present encourages pixmap destruction

The PresentPixmap request says:

PresentRegion holds a reference to ‘pixmap’ until the presentation occurs, so ‘pixmap’ may be immediately freed after the request executes, even if that is before the presentation occurs.

Breaking this when doing Present redirection seems like a bad idea; it’s a very common pattern for existing 2D applications.

New pixmap IDs for everyone

Because pixmap IDs for present contents are owned by the source application (and may be re-used immediately when freed), Present can’t use that ID in the PresentRedirectNotify event. Instead, it must allocate a new ID and send that instead. The server has it’s own XID space to use, and so it can allocate one of those and bind it to the same underlying pixmap; the pixmap will not actually be destroyed until both the application XID and the server XID are freed.

The compositing manager will receive this new XID, and is responsible for freeing it when it doesn’t need it any longer. Of course, if the compositing manager exits, all of the XIDs passed to it will automatically be freed.

I considered allocating client-specific XIDs for this purpose; the X server already has a mechanism for allocating internal IDs for various things that are to be destroyed at client shut-down time. Those XIDs have bit 30 set, and are technically invalid XIDs as X specifies that the top three bits of every XID will be zero. However, the cost of using a server ID instead (which is a valid XID) is small, and it’s always nice to not intentionally break the X protocol (even as we continue to take advantage of “accidental” breakages).

(Reserving the top three bits of XIDs and insisting that they always be zero was a nod to the common practice of weakly typed languages like Lisp. In these languages, 32-bit object references were typed by using few tag bits (2-3) within the value. Most of these were just pointers to data, but small integers, that fit in the remaining bits (29-30), could be constructed without allocating any memory. By making XIDs only 29 bits, these languages could be assured that all XIDs would be small integers.)

Pixmap Destroy Notify

The Compositing manager needs to know when the application is done with the pixmap so that it can clean up when it is also done; destroying the extra pixmap ID it was given and freeing any other local resources. When the application sends a FreePixmap request, that XID will become invalid, but of course the pixmap itself won’t be freed until the compositing manager sends a FreePixmap request with the extra XID it was given.

Because the pixmap ID known by the compositing manager is going to be different from the original application XID, we need an event delivered to the compositing manager with the new XID, which makes this event rather Present specific. We don’t need to select for this event; the compositing manager must handle it correctly, so we’ll just send it whenever the composting manager has received that custom XID.


    type: CARD8         XGE event type (35)
    extension: CARD8        Present extension opcode
    sequence-number: CARD16
    length: CARD32          0
    evtype: CARD16          Present_PixmapDestroyNotify
    event-window: WINDOW
    pixmap: pixmap

This event is delivered to the clients selecting for SubredirectNotify for pixmaps which were delivered in a PresentRedirectNotify event and for which the originating application Pixmap has been destroyed. Note that the pixmap may still be referenced by other objects within the X server, by a pending PresentPixmap request, as a window background, GC tile or stipple or Picture drawable (among others), but this event serves to notify the selecting client that the application is no longer able to access the underlying pixmap with it’s original Pixmap ID.

Posted Wed Jan 1 20:20:40 2014 Tags: fdo

Object Lifetimes under Present Redirection

Present extension redirection passes responsibility for showing application contents from the X server to the compositing manager. This eliminates an extra copy to the composite extension window buffer and also allows the application contents to be shown in the right frame.

(Currently, the copy from application buffer to window buffer is synchronized to the application-specified frame, and then a Damage event is delivered to the compositing manager which constructs the screen image using the window buffer and presents that to the screen at least one frame later, which generally adds another frame of delay for the application.)

The redirection operation itself is simple — just wrap the PresentPixmap request up into an event and send it to the compositing manager. However, the pixmap to be presented is allocated by the application, and hence may disappear at any time. We need to carefully control the lifetime of the pixmap ID and the specific frame contents of the pixmap so that the compositing manager can reliably construct complete frames.

We’ll separately discuss the lifetime of the specific frame contents from that of the pixmap itself. By the “frame contents”, I mean the image, as a set of pixel values in the pixmap, to be presented for a specific frame.

Present Pixmap contents lifetime

After the application is finished constructing a particular frame in a pixmap, it passes the pixmap to the X server with PresentPixmap. With non-redirected Present, the X server is responsible for generating a PresentIdleNotify event once the server is finished using the contents. There are three different cases that the server handles, matching the three different PresentCompleteModes:

  1. Copy. The pixmap contents are not needed after the copy operation has been performed. Hence, the PresentIdleNotify event is generated when the target vblank time has been reached, right after the X server calls CopyArea

  2. Flip. The pixmap is being used for scanout, and so the X server won’t be done using it until some other scanout buffer is selected. This can happen as a result of window reconfiguration which makes the displayed window no longer full-screen, but the usual case is when the application presents a subsequent frame for display, and the new frame replaces the old. Thus, the PresentIdleNotify event generally occurs when the target vblank time for the subsequent frame has been reached, right after the subsequent frame’s pixmap has been selected for scanout.

  3. Skip. The pixmap contents will never be used, and the X server figures this out when a subsequent frame is delivered with a matching target vblank time. This happens when the subsequent Present operation is queued by the X server.

In the Redirect case, the X server cannot tell when the compositing manager is finished with the pixmap. The same three cases as above apply here, but the results are slightly different:

  1. Composite. The pixmap is being used as a part of the screen image and must be composited with other window pixmaps. In this case, the compositing manager will need to hold onto the pixmap until a subsequent pixmap is provided by the application. Thus, the pixmap will remain needed by the compositing manager until it receives a subsequent PresentRedirectNotify for the same window.

  2. Flip. The compositing manager is free to take the application pixmap and use it directly in a subsequent PresentPixmap operation and let the X server ‘flip’ to it; this provides a simple way of avoiding an extra copy while not needing to fuss around with ‘unredirecting’ windows. In this case, the X server will need the pixmap contents until a new scanout pixmap is provided, and the compositing manager will also need the pixmap in case the contents are needed to help construct a subsequent frame.

  3. Skip. In this case, the compositing manager notices that the window’s pixmap has been replaced before it was ever used.

In case 2, the X server and the compositing manager will need to agree on when the PresentIdleNotify event should be delivered. In the other two cases, the compositing manager itself will be in charge of that.

To let the compositing manager control when the event is delivered, the X server will count the number of PresentPixmap redirection events sent, and the compositing manager will deliver matching PresentIdle requests.


    pixmap: PIXMAP
Errors: Pixmap, Match

Informs the server that the Pixmap passed in a PresentRedirectNotify event is no longer needed by the client. Each PresentRedirectNotify event must be matched by a PresentIdle request for the originating client to receive a PresentIdleNotify event.

PresentRedirect pixmap ID lifetimes

A compositing manager will want to know the lifetime of the pixmaps delivered in PresentRedirectNotify events to clean up whatever local data it has associated with it. For instance, GL compositing managers will construct textures for each pixmap that need to be destroyed when the pixmap disappears.

Some kind of PixmapDestroyNotify event is necessary for this; the alternative is for the compositing manager to constantly query the X server to see if the pixmap IDs it is using are still valid, and even that isn’t reliable as the application may re-use pixmap IDs for a new pixmap.

It seems like this PixmapDestroyNotify event belongs in the XFixes extension—it’s a general mechanism that doesn’t directly relate to Present. XFixes doesn’t currently have any Generic Events associated, but adding that should be fairly straightforward. And then, have the Present extension automatically select for PixmapDestroyNotify events when it delivers the pixmap in a PresentRedirectNotify event so that the client is ensured of receiving the associated PixmapDestroyNotify event.

One remaining question I have is whether this is sufficient, or if the compositing manager needs a stable pixmap ID which lives beyond the life of the application pixmap ID. If so, the solution would be to have the X server allocate an internal ID for the pixmap and pass that to the client somehow; presumably in addition to the original pixmap ID.



    Defines a unique event delivery target for Present
    events. Multiple event IDs can be allocated to provide
    multiple distinct event delivery contexts.

PIXMAPEVENTS { XFixesPixmapDestroyNotify }

    eventid: XFIXESEVENTID
    pixmap: PIXMAP
Errors: Pixmap, Value

Changes the set of events to be delivered for the target pixmap. A Value error is sent if ‘events’ contains invalid event selections.


    type: CARD8         XGE event type (35)
    extension: CARD8        XFixes extension request number
    sequence-number: CARD16
    length: CARD32          0
    evtype: CARD16          XFixes_PixmapDestroyNotify
    pixmap: pixmap

This event is delivered to all clients selecting for it on ‘pixmap’ when the pixmap ID is destroyed by a client. Note that the pixmap may still be referenced by other objects within the X server, as a window background, GC tile or stipple or Picture drawable (among others), but this event serves to notify the selecting client that the ID is no longer associated with the underlying pixmap.

Posted Sat Dec 28 13:38:45 2013 Tags: fdo

Cleaning up X server warnings

So I was sitting in the Narita airport with a couple of other free software developers merging X server patches. One of the developers was looking over my shoulder while the X server was building and casually commented on the number of warnings generated by the compiler.

I felt like I had invited someone into my house without cleaning for months — embarrassed and ashamed that we’d let the code devolve into this state.

Of course, we’ve got excuses — the X server code base is one of the oldest pieces of regularly used free software in existence. It was started before ANSI-C was codified. No function prototypes, no ‘const’, no ‘void *’, no enums or stdint.h. There may be a few developers out there who remember those days (fondly, of course), but I think most of us are glad that our favorite systems language has gained a lot of compile-time checking in the last 25 years.

We’ve spent time in the past adding function prototypes and cleaning up other warnings, but there was never a point at which the X server compiled without any warnings. More recently, we’ve added a pile of new warning flags when compiling the X server which only served to increase the number of warnings dramatically.

The current situation

With the master branch of the X server and released versions of the dependencies, we generate 1047 warnings in the default build.

-Wcast-qual considered chatty

The GCC flag, -Wcast-qual, complains when you cast a pointer and change the ‘const’ qualifier status. A very common thing for the X server to do is declare pointers as ‘const’ to mark them as immutable once assigned. Often, the relevant data is actually constructed once at startup in allocated memory and stored to the data structure. During server reset, that needs to be freed, but free doesn’t take a const pointer, so we cast to (void *), which -Wcast-qual then complains about. Loudly.

Of the 1047 warnings, 380 of them are generated by this one warning flag. I’ve gone ahead and just disabled it in util/macros for now.

String constants are a pain

The X server uses string constants to initialize defaults for font paths, configuration options, font names along with a host of other things. These end up getting stored in variables that can also take allocated storage. I’ve gone ahead and declared the relevant objects as const and then fixed the code to suit.

I don’t have a count of the number of warnings these changes fixed; they were scattered across dozens of X server directories, and I was fixing one directory at a time, but probably more than half of the remaining warnings were of this form.

And a host of other warnings

Fixing the rest of the warnings was mostly a matter of stepping through them one at a time and actually adjusting the code. Shadowed declarations, unused values, redundant declarations and missing printf attributes were probably the bulk of them though.

Changes to external modules

Instead of just hacking the X server code, I’ve created patches for other modules where necessary to fix the problems in the “right” place.

  • proto/fontsproto. Declares FontPathElement names as ‘const char *’
  • mesa/drm. Adds ‘printf’ attribute to the debug_print function
  • util/macros. Removes -Wcast-qual from the default warning set.

Getting the bits

In case it wasn’t clear, the X server build now generates zero warnings on my machine.

I’m hoping that this will also be true for other people. Patches are available at:

xserver - git:// warning-fixes
fontsproto - git:// fontsproto-next
mesa/drm - git:// warning-fixes
util/macros - already upstream on master

Keeping our house clean

Of course, these patches are all waiting until 1.15 ships so that we don’t accidentally break something important. However, once they’re merged, I’ll be bouncing any patches which generate warnings on my system, and if other people find warnings when they build, I’ll expect them to send patches as well.

Now to go collect the tea cups in my office and get them washed along with the breakfast dishes so I won’t be embarrassed if some free software developers show up for lunch today.

Posted Fri Dec 13 13:58:38 2013 Tags: fdo

Tracking Cursor Position

I spent yesterday morning in the Accessibility BOF here at Guadec and was reminded that one persistent problem with tools like screen magnifiers and screen readers is that they need to know the current cursor position all the time, independent of which window the cursor is in and independent of grabs.

The current method that these applications are using to track the cursor is to poll the X server using XQueryPointer. This is obviously terrible for at least a couple of reasons:

  • Keeps the system active at regular intervals, preventing power savings.

  • Increased latency in mouse tracking—the interval between polling calls limits the time resolution of the position information.

These two problems also conflict with one another. Reducing input latency comes at the cost of further reducing the opportunities for power saving, and vice versa.

XInput2 to the rescue (?)

XInput2 has the ability to deliver raw device events right to applications, bypassing the whole event selection mechanism within the X server. This was designed to let games and other applications see relative mouse motion events and drawing applications see the whole tablet surface.

These raw events are really raw though; they do not include the cursor position, and so cannot be directly used for tracking.

However, we do know that the cursor only moves in response to input device events, so we can easily use the arrival of a raw event to trigger a query for the mouse position.

A better plan?

Perhaps what we should do is to actually create a new event type to report the cursor position and the containing window so that applications can simply track that. Yeah, it’s a bit of a special case, but it’s a common requirement for accessibility tools.

        detail:                    CARD32
        sourceid:                  DEVICEID
        flags:                     DEVICEEVENTFLAGS
    root:                      WINDOW
    window:                    WINDOW
    root-x, root-y:            INT16
    window-x, window-y:        INT16

A CursorEvent is sent whenever a sprite moves on the screen. ‘sourceid’ is the master pointer which is moving. ‘root’ is the root window containing the cursor, ‘window’ is the window that the pointer is in. ‘root-x and root-y’ indicate the position within the root window, ‘window-x’ and ‘window-y’ indicate the position within ‘window’.

Demo Application

Here’s a short application, hacked from Peter Hutterer’s ‘part1.c’

/* cc -o track_cursor track_cursor.c `pkg-config --cflags --libs xi x11` */

#include <stdio.h>
#include <string.h>
#include <X11/Xlib.h>
#include <X11/extensions/XInput2.h>

/* Return 1 if XI2 is available, 0 otherwise */
static int has_xi2(Display *dpy)
    int major, minor;
    int rc;

    /* We support XI 2.2 */
    major = 2;
    minor = 2;

    rc = XIQueryVersion(dpy, &major, &minor);
    if (rc == BadRequest) {
    printf("No XI2 support. Server supports version %d.%d only.\n", major, minor);
    return 0;
    } else if (rc != Success) {
    fprintf(stderr, "Internal Error! This is a bug in Xlib.\n");

    printf("XI2 supported. Server provides version %d.%d.\n", major, minor);

    return 1;

static void select_events(Display *dpy, Window win)
    XIEventMask evmasks[1];
    unsigned char mask1[(XI_LASTEVENT + 7)/8];

    memset(mask1, 0, sizeof(mask1));

    /* select for button and key events from all master devices */
    XISetMask(mask1, XI_RawMotion);

    evmasks[0].deviceid = XIAllMasterDevices;
    evmasks[0].mask_len = sizeof(mask1);
    evmasks[0].mask = mask1;

    XISelectEvents(dpy, win, evmasks, 1);

int main (int argc, char **argv)
    Display *dpy;
    int xi_opcode, event, error;
    XEvent ev;

    dpy = XOpenDisplay(NULL);

    if (!dpy) {
    fprintf(stderr, "Failed to open display.\n");
    return -1;

    if (!XQueryExtension(dpy, "XInputExtension", &xi_opcode, &event, &error)) {
       printf("X Input extension not available.\n");
          return -1;

    if (!has_xi2(dpy))
    return -1;

    /* select for XI2 events */
    select_events(dpy, DefaultRootWindow(dpy));

    while(1) {
    XGenericEventCookie *cookie = &ev.xcookie;
    XIRawEvent      *re;
    Window          root_ret, child_ret;
    int         root_x, root_y;
    int         win_x, win_y;
    unsigned int        mask;

    XNextEvent(dpy, &ev);

    if (cookie->type != GenericEvent ||
        cookie->extension != xi_opcode ||
        !XGetEventData(dpy, cookie))

    switch (cookie->evtype) {
    case XI_RawMotion:
        re = (XIRawEvent *) cookie->data;
        XQueryPointer(dpy, DefaultRootWindow(dpy),
                  &root_ret, &child_ret, &root_x, &root_y, &win_x, &win_y, &mask);
        printf ("raw %g,%g root %d,%d\n",
            re->raw_values[0], re->raw_values[1],
            root_x, root_y);
    XFreeEventData(dpy, cookie);

    return 0;

Hacks in xeyes

Of course, one “common” mouse tracking application is xeyes, so I’ve hacked up that code (on top of my present changes) here:

git clone git://
Posted Wed Aug 7 03:46:14 2013 Tags: fdo

Present Extension Redirection

Multi-buffered applications have always behaved poorly in the presence of the Composite extension:

  • Updating involves multiple copies of the window contents, first from back buffer to composite redirect buffer, thence from the composite buffer to the screen back buffer, and then from the screen back buffer to the scanout buffer.

    The last copy is amenable to page flipping, but that would require the EGL buffer age extension so that the compositor could reduce the cost of presenting following frames by being able to track the contents of the back buffers.

  • The application is not informed about when the actual screen presentation occurs.

Owen Taylor suggested that Present should offer a way to ‘redirect’ operations to the compositing manager as a way to solve these problems. This posting is my attempt to make that idea a bit more concrete given the current Present design.

Design Goals

Here’s a list of features I think we should try to provide:

  1. Provide accurate information to applications about when presentation to the screen actually occurs. In particular, GLX applications using GLX_OML_sync_control should receive the correct information in terms of UST and MSC for each Swap Buffers request.

  2. Ensure that applications still receive correct information as to the contents of their buffers, in particular we want to be able to implement EGL_EXT_buffer_age in a useful manner.

  3. Avoid needing to “un-redirect” full-screen windows to get page flipping behavior.

  4. Eliminate all extra copies. A windowed application may still perform one copy from back buffer to scanout buffer, but there shouldn’t be any reason to copy contents to the composite redirection buffer or the compositing manager back buffer.

Simple Present Redirection

With those goals in mind, here’s what I see as the sequence of events for a simple windowed application doing a new full-window update without any translucency or window transformation in effect:

  1. Application creates back buffer, draws new frame to it.

  2. Application executes PresentRegion. In this example, the ‘valid’ and ‘update’ parameters are ‘None’, indicating that the full window should be redrawn.

  3. The server captures the PresentRegion request and constructs a PresentRedirectNotify event containing sufficient information for the compositor to place that image correctly on the screen:

    • target window for the presentation
    • source pixmap containing the new image
    • idle fence to notify when the source pixmap is no longer in use.
    • serial number from the request.
    • target MSC for the presentation. This should probably just be the computed absolute MSC value, and not the original target/divisor/numerator values provided by the application.
  4. The compositing manager receives this event and constructs a new PresentRegion request using the provided source pixmap, but this time targeting the root window, and constructing a ‘valid’ region which clips the pixmap to the shape of the window on the screen.

    This request would use the original application’s idle fence value so that when complete, the application would get notified.

    This request would need to also include the original target window and serial number so that a suitable PresentCompleteNotify event can be constructed and delivered when the final presentation is complete.

  5. The server executes this new PresentRegion operation. When complete, it delivers PresentCompleteNotify events to both the compositing manager and the application.

  6. Once the source pixmap is no longer in use (either the copy has completed, or the screen has flipped away from this pixmap), the server triggers the idle fence.

Multiple Application Redirection

If multiple applications perform PresentRegion operations within the same frame, then the compositing manager will receive multiple PresentRedirectNotify events, and can simply construct multiple new PresentRegion requests. If these are all queued to the same global MSC, they will execute at the same frame boundary. No inter-operation dependency exists here.

Complex Presentations

Ok, so the simple case looks like it’s pretty darn simple to implement, and satisfies the design goals nicely. Let’s look at a couple of more complicated cases in common usage; the first is with translucency, the second with scaling application images down to thumbnails and the third with partial application window updates.

Redirection with Translucency

If the compositing manager discovers that a portion of the updated region overlays or is overlaid by something translucent (either another window, or drop shadows, or something else), then a composite image for that area must be constructed before the whole can be presented. Starting when the compositing manager receives the event, we have:

  1. The compositing manager receives this event. Using the new pixmap, along with pixmaps for the other involved windows and graphical elements, the compositing manager constructs an updated view for the affected portion of the screen in a back buffer pixmap. Once complete, a PresentRegion operation that uses this back buffer pixmap is sent to the X server.

    Again, the original target window and serial number are also delivered to the server so that a suitable PresentCompleteNotify event can be delivered to the application.

  2. The server executes this new PresentRegion operation; PresentCompleteNotify events are delivered, and idle fences triggered as appropriate.

Redirection with Transformation

Transformation of the window contents means that we cannot always update a portion of the back buffer directly from the provided application pixmap as that will not contain the window border. Contents generated from a region that includes both application pixels and window border pixels must be sourced from a single pixmap containing both sets of pixels.

One option that I’ve discussed in the past to solve this would be to have the original application allocate the pixmap large enough to hold both the application contents and the surrounding window border. Have it draw the application contents at the correct offset within this pixmap, and then have the window manager contents drawn around that; either automatically by the X server, or even manually by the compositing manager.

That would be mighty convenient for the compositing manager, but would require significant additional infrastructure throughout the X server and—even harder—the drawing system (OpenGL or some other system). There’s another reason to want this though, and that’s for sub-frame buffer scanout page swapping.

The second option would be for the compositing manager to combine these images itself; there’s a nice pixmap already containing the window manager image—the composite redirect buffer. Taking the provided source pixmap and copying it directly to the target window will construct our composite image, just as if we had no Present redirection in place.

This will cost an additional copy though, which we’ve promised to avoid. Of course, as it’s just for thumb-nailing or other visual effects, perhaps the compositing manager could perform this operation at a reduced frame rate, so that overall system performance didn’t suffer.

Retaining Access to the Application Buffer

Above, I discussed having the idle fence from the redirected PresentRegion operation be sent along with the replacement PresentRegion operation. This ignores the fact that the composting manager may well need the contents of that application frame again in the future, when displaying changes for other applications that involve the same region of the screen.

With the goal of making sure the idle fences are triggered as soon as possible so that applications can re-use or free the buffers quickly, let’s think about when the triggering can occur.

  1. Full-screen flipped applications. In this case, the application’s idle fence can be triggered once the application provides a new frame and the X server has flipped to that new frame, or some other scanout buffer.

  2. Windowed, copied applications. In this case, the application’s idle fence can be triggered once the application provides a new frame to the compositing manager, and the X server doesn’t have any presentations queued.

In both cases, we require that both the X server and the compositing manager be ‘finished’ with the buffer before the application’s idle fence should be triggered.

One easy way to get this behavior is for the composting manager to create a new idle fence for its operations. When that is triggered, it would receive an X event and then trigger the applications idle fence as appropriate. This would add considerable latency to the application’s idle fence—a round trip through the compositing manager.

The alternative would be to construct some additional protocol to make the applications idle fence ‘dependent’ on the Present operation and some additional state provided by the compositing manager.

Some experimentation here is warranted, but my experience with latency in this area is that it causes applications to end up allocating another back buffer as the idle notification arrives just after a buffer allocation request comes down to the rendering library. Definitely sub-optimal.

An Aside on Media Stream Counters

The GLX_OML_sync_control extension defines the Media Stream Counter (MSC) as a counter unique to the graphics subsystem which is incremented each time a vertical retrace occurs. That would be trivial if we had only one vertical retrace sequence in the world. However, we actually have N+1 such counters, one for each of the N active monitors in the system and a separate fake counter to be used when none of the other counters is available.

In the current Present implementation, windows transition among the various Media Stream Counter domains as they move between the various monitors, and those monitors get turned on and off. As they move between these counter domains, Present tracks a global offset from their original domain. This offset ensures that the MSC value remains monotonically increasing as seen by each window. What it does not ensure is that all windows have comparable MSC sequence values; two windows on the same monitor may well have different MSC values for the same physical retrace event.

And, even moving a window from one MSC domain to another and back won’t make it return to the original MSC sequence values due to differences in refresh rates between the monitors.

Internally, Present asserts that each CRTC in the system identifies a unique MSC domain, and it has a driver API which identifies which CRTC a particular window should be associated with. Once a particular CRTC has been identified for a window, client-relative MSC values and CRTC-relative MSC values can be exchanged using an offset between that CRTC MSC domain and the window MSC domain.

The Intel driver assigns CRTCs to windows by picking the CRTC showing the greatest number of pixels for a particular window. When two CRTCs show the same number of pixels, the Intel driver picks the first in the list.

Vblank Synchronization and Multiple Monitors

Ok, so each window lives in a particular MSC domain, clocked by the MSC of the CRTC the driver has associated it with. In an un-composited world, this makes picking when to update the screen pretty simple; Present updates the screen when vblank happens in the CRTC associated with the window.

In the composite redirected case, it’s a bit harder — all of the PresentRegion operations are going to target the root window, and yet we want updates for each window to be synchronized with the monitor containing that window. Of course, the root window belongs to a single MSC domain (likely the largest monitor, using the selection algorithm described above from the Intel driver). So, any PresentRegion requests will be timed relative to that single monitor.

I think what is required here is for the PresentRegion extension to take an optional CRTC argument, which would then be used as the MSC domain instead of the window MSC domain. All of the timing arguments would be interpreted relative to that CRTC MSC domain.

The PresentRedirectNotify event would then contain the relevant CRTC and the MSC value would be relative to that CRTC.

A clever Compositing manager could then decompose a global PresentRegion operation into per-CRTC PresentRegion operations and ensure that multiple monitors were all synchronized correctly.

We could take this even further and have the PresentRegion capable of passing a smaller CRTC-sized pixmap down to the kernel, effectively providing per-CRTC pixmaps with no visible explicit protocol…

Other Composite Users

Ok, so the above discussion is clearly focused on getting the correct contents onto the screen with minimal copies along the way. However, what I’ve ignored is how to deal with other applications, also using Composite at the same time. They’re all going to expect that the composite redirect buffers will contain correct window contents at all times, and yet we’ve just spent a bunch of time making that not be the case to avoid copying data into those buffers and instead copying directly to the compositing manager back or front buffers.

Obviously the X server is aware of when this happens; the compositing manager will have selected for manual redirection on all top-level windows, while our other application will have only been able to select for automatic redirection.

So, we’ve got two pretty clear choices here:

  1. Have the X server change how Present redirection works when some other application selects for Automatic redirection on a window. It would copy the source pixmap into the window buffer and then send (a modified?) PresentRedirectNotify event to the compositing manager.

  2. Include a flag in the PresentRedirectNotify event that the composite redirect buffer needs to eventually get the contents of the source pixmap, and then expect the compositing manager to figure out what to do.

Development Plans

As usual, I’m going to pick the path of least resistance for all of the above options and see how things look; where the easy thing works, we can keep using it. Where the easy thing fails, I’ll try something else. The changes required for this are pretty minimal.

The PresentRegion request needs to gain a list of window/serial pairs that are also to be notified when the operation completes:

    window: WINDOW
    serial: CARD32

    window: WINDOW
    pixmap: PIXMAP
    serial: CARD32
    valid-area: REGION or None
    update-area: REGION or None
    x-off, y-off: INT16
    idle-fence: FENCE
    target-crtc: CRTC or None
    target-msc: CARD64
    divisor: CARD64
    remainder: CARD64
    Errors: Window, Pixmap, Match

The ‘target-crtc’ parameter explicitly identifies a CRTC MSC domain. If None, then this request implicitly uses the window MSC domain.

‘notifies’ provides a list of windows that will also receive PresentCompleteNotify events with the associated serial number when this PresentRegion operation completes.

    type: CARD8         XGE event type (35)
    extension: CARD8        Present extension request number
    length: CARD16          2
    evtype: CARD16          Present_RedirectNotify
    event-window: WINDOW
    window: WINDOW
    pixmap: PIXMAP
    serial: CARD32
    valid-area: REGION
    valid-rect: RECTANGLE
    update-area: REGION
    update-rect: RECTANGLE
    x-off, y-off: INT16
    target-crtc: CRTC
    target-msc: CARD64
    idle-fence: FENCE
    update-window: BOOL

The ‘target-crtc’ identifies which CRTC MSC domain the ‘target-msc’ value relates to.

‘divisor’ and ‘remainder’ have been removed as the target-msc value has been adjusted using the application values.

If ‘update-window’ is True, then the recipient of this event is instructed to provide reasonably up-to-date contents directly to the window by copying the contents of ‘pixmap’ to the window manually.

Beyond these two protocol changes, the compositing manager is expected to receive Sync events when the idle-fence is triggered and then manually perform a Sync operation to trigger the client’s idle-fence when appropriate.

I’m planning to work on these changes, and then go re-work xcompmgr (or perhaps unagi, which certainly looks less messy) to incorporate support for Present redirection. The goal is to have something to demonstrate at Guadec, which doesn’t seem impossible, aside from leaving on vacation in four days…

Posted Tue Jul 23 17:40:21 2013 Tags: fdo

Implementing Vblank Synchronization in the Present Extension

This is mostly a status update on how the Present extension is doing; the big news this week is that I’ve finished implementing vblank synchronized blts and flips, and things seem to be working quite well.

Vblank Synchronized Blts

The goal here is to have the hardware executing the blt operation in such a way as to avoid any tearing artifacts. In current drivers, there are essentially two different ways to make this happen:

  1. Insert a command into the ring which blocks execution until a suitable time immediately preceding the blt operation.

  2. Queue the blt operation at vblank time so that it executes before the scanout starts.

Option 1. provides the fewest artifacts; if the hardware can blt faster than scanout, there shouldn’t ever be anything untoward visible on the screen. However, it also blocks future command execution within the same context. For example, if two vblank synchronized blts are queued at the same time, its possible for the second blt to be delayed by yet another frame, causing both applications to run at half of the frame rate.

Option 2. avoids blocking the hardware, allowing for ongoing operations to proceed on the hardware without waiting for the synchronized blt to complete. However, it can cause artifacts if the delay from the vblank event to the eventual execution of the blt command is too long.

Queuing the blt right when it needs to execute means that we also have the opportunity to skip some blts; if the application presents two buffers within the same frame time, the blt of the first buffer can be skipped, saving memory bandwidth and time.

Present uses Option 2, which may occasionally cause a tearing artifact, but avoids slowing down applications while allowing the X server to discard overlapping blt operations when possible.

Queuing the Blt at Vblank

There are several options for getting the blt queued and executed when the vblank occurs:

  1. Queue the blt from the interrupt handler

  2. Queue the blt from a kernel thread running in response to the interrupt

  3. Send an event up to user space and have the X server construct the blt command.

These are listed in increasing maximum latency, but also in decreasing complexity.

Option 1. is made more complicated as much of the work necessary to get a command queued to the hardware cannot be done from interrupt context. One can imagine having the desired command already present in the ring buffer and have the interrupt handler simply move the ring tail pointer value. Future operations to be queued before the vblank operation could then re-write the ring as necessary. A queued operation could also be adjusted by the X server as necessary to keep it correct across changes to the window system state.

Option 2. is similar, but the kernel implementation should be quite a bit simpler as the queuing operation is done in process context and can use the existing driver infrastructure. For the X server, this is the same as Option 1, requiring that it construct a queued blt operation and deliver that to the kernel, and then revoke and re-queue if the X server state changed before the operation was completed.

Option 3 is the simplest of all, requiring no changes within the kernel and few within the X server. The X server waits to receive a vblank notification event for the appropriate frame and then simply invokes existing mechanisms to construct and queue the blt operation to the kernel.

Oddly, Present currently uses Option 3. If that proves to generate too many display artifacts, we can come back and change the code to try something more complicated.

Flipping the Frame Buffer

Taking advantage of the hardware’s ability to quickly shift scanout from one chunk of memory to another is critical to providing efficient buffer presentation within the X server. It is slightly more complicated to implement than simply copying data to the current scanout buffer for a few reasons:

  1. The presented pixmap is owned by the application, and so it shouldn’t be used except when the presented window covers the whole screen. When the window gets reconfigured, we end up copying the window’s pixmap to the regular screen pixmap.

  2. The kernel flipping API is asynchronous, and doesn’t provide any abort mechanism. This isn’t usually much of an issue; we simply delay reporting the actual time of flip until the kernel sends the notification event to the X server. However, if the window is reconfigured or destroyed while the flip is still pending, cleaning that up must wait until the flip has finished.

  3. The applications buffer remains ‘busy’ until it is no longer being used for scanout; that means that applications will have to be aware of this and ensure that they don’t deadlock waiting for the current scanout buffer to become idle before switching to a new scanout buffer.

Present is different from DRI2 in using application-allocated buffers for this operation. For DRI2, when flipping to a window buffer, that buffer becomes the screen pixmap — the driver flips the new buffer object into the screen pixmap and releases the previous buffer object for other use. For Present, as the buffer is owned by the application, I figured it would be better to switch back to the ‘real’ screen buffer when necessary. This also means that applications aren’t left holding a handle to the frame buffer, which seems like it might be a nice feature.

The hardest part of this work was dealing with client and server shutdown, dealing with objects getting deleted in random orders while other data structures retained references.

(The kernel DRM drivers use the term ‘page flipping’ to mean an atomic flip between one frame buffer and another, generally implemented by simply switching the address used for the scanout buffer. I’d like to avoid using the word ‘page’ in this context as we’re not flipping memory pages individually, but rather a huge group of memory that forms an entire frame buffer. We could use ‘plane flipping’ (as intel docs do), ‘frame buffer flipping’ (but that’s a mouthful), ‘display flipping’ or almost anything but ‘page flipping’).

Overall DRI3000 Status

At this point, the DRI3 extension is complete and the Present extension is largely complete, except for redirection for compositors. The few piglit tests for GLX_OML_sync_control all pass now, which is at least better than DRI2 does.

I think I’ve effectively replicated the essential features of DRI2 while offering room to implement a couple of new GL extensions:

  • GLX_EXT_swap_control_tear. This will provide applications with the ability to avoid dropping frames when pushing the hardware just over the frame rate limit.

  • EGL_EXT_buffer_age. (I assume we’ll probably want a GLX version as well?) This will allow compositors to more efficiently perform partial updates in a flipping environment, and is enabled by having all of the buffer management within the GL library.

The code for this stuff has all been pushed to a number of repositories:

  • git:// master. DRI3 protocol specification and X server headers.
  • git:// master. Present protocol specification and X server headers.
  • git:// dri3. XCB protocol defines for both DRI3 and Present.
  • git:// dri3. XCB library changes for file descriptor passing.
  • git:// dri3. X server with file descriptor passing, DRI3 and Present support.
  • git:// dri3 Mesa with DRI3/Present support for GLX.
  • git:// dri3. DRM library with defines for async flipping.
  • git:// dri3. Intel driver with DRI3, Present and async flipping support.
  • git:// dri3. Kernel with async flipping.
Posted Mon Jul 22 17:09:29 2013 Tags: fdo