Core Rendering with Glamor

I've hacked up the intel driver to bypass all of the UXA paths when Glamor is enabled so I'm currently running an X server that uses only Glamor for all rendering. There are still too many fall backs, and performance for some operations is not what I'd like, but it's entirely usable. It supports DRI3, so I even have GL applications running.

Core Rendering Status

I've continued to focus on getting the core X protocol rendering operations complete and correct; those remain a critical part of many X applications and are a poor match for GL. At this point, I've got accelerated versions of the basic spans functions, filled rectangles, text and copies.

GL and Scrolling

OpenGL has been on a many-year vendetta against one of the most common 2D accelerated operations -- copying data within the same object, even when that operation overlaps itself. This used to be the most performance-critical operation in X; it was used for scrolling your terminal windows and when moving windows around on the screen.

Reviewing the OpenGL 3.x spec, Eric and I both read the glCopyPixels specification as clearly requiring correct semantics for overlapping copy operations -- it says that the operation must be equivalent to reading the pixels and then writing the pixels. My CopyArea acceleration thus uses this path for the self-copy case. However, the ARB decided that having a well defined blt operation was too nice to the users, so the current 4.4 specification adds explicit language to assert that this is not well defined anymore (even in the face of the existing language which is pretty darn unambiguous).

I suspect we'll end up creating an extension that offers what we need here; applications are unlikely to stop scrolling stuff around, and GPUs (at least Intel) will continue to do what we want. This is the kind of thing that makes GL maddening for 2D graphics -- the GPU does what we want, and the GL doesn't let us get at it.

For implementations not capable of supporting the required semantic, someone will presumably need to write code that creates a temporary copy of the data.

PBOs for fall backs

For operations which Glamor can't manage, we need to fall back to using a software solution. Direct-to-hardware acceleration architectures do this by simply mapping the underlying GPU object to the CPU. GL doesn't provide this access, and it's probably a good thing as such access needs to be carefully synchronized with GPU access, and attempting to access tiled GPU objects with the CPU require either piles of CPU code to 'de-tile' accesses (ala wfb), or special hardware detilers (like the Intel GTT).

However, GL does provide a fairly nice abstraction called pixel buffer objects (PBOs) which work to speed access to GPU data from the CPU.

The fallback code allocates a PBO for each relevant X drawable, asks GL to copy pixels in, and then calls fb, with the drawable now referencing the temporary buffer. On the way back out, any potentially modified pixels are copied back through GL and the PBOs are freed.

This turns out to be dramatically faster than malloc'ing temporary buffers as it allows the GL to allocate memory that it likes, and for it to manage the data upload and buffer destruction asynchronously.

Because X pixmaps can contain many X windows (the root pixmap being the most obvious example), they are often significantly larger than the actual rendering target area. As an optimization, the code only copies data from the relevant area of the pixmap, saving considerable time as a result. There's even an interface which further restricts that to a subset of the target drawable which the Composite function uses.

Using Scissoring for Clipping

The GL scissor operation provides a single clipping rectangle. X provides a list of rectangles to clip to. There are two obvious ways to perform clipping here -- either perform all clipping in software, or hand each X clipping rectangle in turn to GL and re-execute the entire rendering operation for each rectangle.

You'd think that the former plan would be the obvious choice; clearly re-executing the entire rendering operation potentially many times is going to take a lot of time in the GPU.

However, the reality is that most X drawing occurs under a single clipping rectangle. Accelerating this common case by using the hardware clipper provides enough benefit that we definitely want to use it when it works. We could duplicate all of the rendering paths and perform CPU-based clipping when the number of rectangles was above some threshold, but the additional code complexity isn't obviously worth the effort, given how infrequently it will be used. So I haven't bothered. Most operations look like this:

Allocate VBO space for data

Fill VBO with X primitives

loop over clip rects {
    glScissor()
    glDrawArrays()
}

This obviously out-sources as much of the problem as possible to the GL library, reducing the CPU time spent in glamor to a minimum.

A Peek at Some Code

With all of these changes in place, drawing something like a list of rectangles becomes a fairly simple piece of code:

First, make sure the program we want to use is available and can be used with our GC configuration:

prog = glamor_use_program_fill(pixmap, gc,
                               &glamor_priv->poly_fill_rect_program,
                               &glamor_facet_polyfillrect);

if (!prog)
    goto bail_ctx;

Next, allocate the VBO space and copy all of the X data into it. Note that the data transfer is simply 'memcpy' here -- that's because we break the X objects apart in the vertex shader using instancing, avoiding the CPU overhead of computing four corner coordinates.

/* Set up the vertex buffers for the points */

v = glamor_get_vbo_space(drawable->pScreen, nrect * (4 * sizeof (GLshort)), &vbo_offset);

glEnableVertexAttribArray(GLAMOR_VERTEX_POS);
glVertexAttribDivisor(GLAMOR_VERTEX_POS, 1);
glVertexAttribPointer(GLAMOR_VERTEX_POS, 4, GL_SHORT, GL_FALSE,
                      4 * sizeof (GLshort), vbo_offset);

memcpy(v, prect, nrect * sizeof (xRectangle));

glamor_put_vbo_space(screen);

Finally, loop over the pixmap tile fragments, and then over the clip list, selecting the drawing target and painting the rectangles:

glEnable(GL_SCISSOR_TEST);

glamor_pixmap_loop(pixmap_priv, box_x, box_y) {
    int nbox = RegionNumRects(gc->pCompositeClip);
    BoxPtr box = RegionRects(gc->pCompositeClip);

    glamor_set_destination_drawable(drawable, box_x, box_y, TRUE, FALSE, prog->matrix_uniform, &off_x, &off_y);

    while (nbox--) {
        glScissor(box->x1 + off_x,
                  box->y1 + off_y,
                  box->x2 - box->x1,
                  box->y2 - box->y1);
        box++;
        glDrawArraysInstanced(GL_TRIANGLE_STRIP, 0, 4, nrect);
    }
}

GL texture size limits

X pixmaps use 16 bit dimensions for width and height, allowing them to be up to 65536 x 65536 pixels. Because the X coordinate space is signed, only a quarter of this space is actually useful, which makes the useful size of X pixmaps only 32767 x 32767. This is still larger than most GL implementations offer as a maximum texture size though, and while it would be nice to just say 'we don't allow pixmaps larger than GL textures', the reality is that many applications expect to be able to allocate such pixmaps today, largely to hold the ever increasing size of digital photographs.

Glamor has always supported large X pixmaps; it does this by splitting them up into tiles, each of which is no larger than the largest texture supported by the driver. What I've added to Glamor is some simple macros that walk over the array of tiles, making it easy for the rendering code to support large pixmaps without needing any special case code.

Glamor also had some simple testing support -- you can compile the code to ignore the system-provided maximum texture size and supply your own value. This code had gone stale, and couldn't work as there were parts of the code for which tiling support just doesn't make sense, like the glyph cache, or the X scanout buffer. I fixed things so that you could leave those special cases as solitary large tiles while breaking up all other pixmaps into tiles no larger than 32 pixels square.

I hope to remove the single-tile case and leave the code supporting only the multiple-tile case; we have to have the latter code, and so having the single-tile code around simply increases our code size for not obvious benefit.

Getting accelerated copies between tiled pixmaps added a new coordinate system to the mix and took a few hours of fussing until it was working.

Rebasing Many (many) Times

I'm sure most of us remember the days before git; changes were often monolithic, and the notion of changing how the changes were made for the sake of clarity never occurred to anyone. It used to be that the final code was the only interesting artifact; how you got there didn't matter to anyone. Things are different today; I probably spend a third of my development time communicating how the code should change with other developers by changing the sequence of patches that are to be applied.

In the case of Glamor, I've now got a set of 28 patches. The first few are fixes outside of the glamor tree that make the rest of the server work better. Then there are a few general glamor infrastructure additions. After that, each core operation is replaced, one a at a time. Finally, a bit of stale code is removed. By sequencing things in a logical fashion, I hope to make review of the code easier, which should mean that people will spend less time trying to figure out what I did and be able to spend more time figuring out if what I did is reasonable and correct.

Supporting Older Versions of GL

All of the new code uses vertex instancing to move coordinate computation from the CPU to the GPU. I'm also pulling textures apart using integer operations. Right now, we should correctly fall back to software for older hardware, but it would probably be nicer to just fall back to simpler GL instead. Unless everyone decides to go buy hardware with new enough GL driver support, someone is going to need to write simplified code paths for glamor.

If you've got such hardware, and are interested in making it work well, please take this as an opportunity to help yourself and others.

Near-term Glamor Goals

I'm pretty sure we'll have the code in good enough shape to merge before the window closes for X server 1.16. Eric is in charge of the glamor tree, so it's up to him when stuff is pulled in. He and Markus Wick have also been generating code and reviewing stuff, but we could always use additional testing and review to make the code as good as possible before the merge window closes.

Markus has expressed an interest in working on Glamor as a part of the X.org summer of code this year; there's clearly plenty of work to do here, Eric and I haven't touched the render acceleration stuff at all, and that code could definitely use some updating to use more modern GL features.

If that works as well as the core rendering code changes, then we can look forward to a Glamor which offers GPU-limited performance for classic X applications, without requiring GPU-specific drivers for every generation of every chip.