Current Glamor Performance

I finally managed to get a Gigabyte Brix set up running Debian so that I could do some more reasonable performance characterization of Glamor in its current state. I wanted to use this particular machine because it has enough cooling to keep from thermally throttling the CPU/GPU package.

This is running my glamor-server branch of the X server, which completes the core operation rework and then has some core X server performance improvements as well for filled and outlined arcs.

Changes in X11perf

First off, I did some analysis of what x11perf was doing and found that it wasn't quite measuring what we thought. I'm not interested in competing on x11perf numbers absolutely, I'm only interested in doing relative measurements of useful operations, so fixing the tool to measure what I want seems reasonable to me.

When x11perf was first written, it drew 100x100 rectangles tight against one another without any gap. And, it filled the window with them, drawing a 6x6 grid of 100x100 rectangles in a 600x600 window. To better exercise the rectangle code and check edge conditions better, we added a one pixel gap between the rectangles. However, we didn't reduce the number of rectangles drawn, so we ended up drawing 11 of the 36 rectangles on top of the first set of 25. Simple region computations would allow the X server to draw only 25 most of the time, skipping the redundant rectangles.

The vertical and horizontal line tests were added a while after the first set of tests, and were done without regard to how an X server might optimize for them. x11perf draws these lines packed tightly together, creating a single square of pixels for the result.

EXA, UXA and SNA all take vertical and horizontal lines and convert them to rectangles, then take the rectangles and clip them against the window clip list by computing a region from them and intersecting that with the GC composite clip. It's a completely reasonable plan, however, when you take what x11perf was drawing and run it through this code, you end up with a single solid rectangle. Which is surprisingly fast, compared with drawing individual lines.

I "fixed" the overlapping rectangle case by reducing the number of boxes drawn from 36 to 25, and I fixed the vertical and horizontal line case by spacing the lines a pixel apart.

I've pushed out these changes to my x11perf repository on freedesktop.org.

What's Fast

Things that match GL's capabilities are fast, things which don't are slow. No surprises there. What's interesting is precisely what matches GL

Patterns For Free

Because GL makes it easy to program fill patterns into the GPU, there are essentially no performance differences between solid and patterned operations.

GL Lines

Glamor uses GL lines, which can be programmed to match X semantics, to quite good effect. The only trick required was to deal with cap styles. GL never draws the final pixel in a line, while X does unless the cap style is CapNotLast. The solution was to draw an extra line segment containing a single pixel at the end of every joined set of lines for this case.

The other implicit requirement is that all zero width lines look the same. Right now, I've solved that for fill styles and raster ops as they're all drawn with the same GL operations. However, for plane masks, we're currently falling back to software, which may draw something different. Fixing that isn't impossible, it's just tedious.

Text

Pushing all of the work of drawing core text into glamor wasn't terribly difficult, and the results are pretty spectacular.

What's Slow

We've still got room for improvement in Glamor, but there aren't any obvious show-stoppers to getting great performance for reasonable X applications anymore.

Wide Lines and Arcs

One of the speed-ups I've made in my glamor branch is to merge all of drawing of multiple filled and zero-width arcs into a single giant GL request. That turned out to both improve performance and save a bit of code. Right now, drawing wide lines and wide arcs doesn't do this, and so we suffer from submitting many smaller requests to GL. It's hard to get excited about speeding any of this up as all of the wide primitives are essentially unused these days.

Filled Polygons

Because X only lets applications draw a single polygon in each request, Glamor can't really gain any efficiency from batching work unless we start looking ahead in the X protocol stream to see if the next request is another polygon. Alternatively, we could leave the span operation pending to see if more spans were coming before the X serve went idle. Neither of these is all that exciting though; X polygons just aren't that useful.

Render Operations

These are still not structured to best fit modern GL; some work here would help a bunch. We've got a gsoc student ready to go at this though, so I expect we'll have much better numbers in a few months.

Window Operations

You wouldn't think that moving and resizing windows would be so limited by drawing performance, but x11perf tests these with tiny little windows, and each operation draws or copies only a couple of little rectangles, which makes GL quite expensive. Working on speeding up GL for small numbers of operations would help a bunch here.

Unexpected Results

Solid rectangles are actually running slower than patterned rectangles, and I really have no idea why. The CPU is close to idle during the 500x500 solid rectangle test (as you'd expect, given the workload), the vertex and fragment shaders look correct out of the compiler, and yet solid rectangles run at only 0.80 of the performance of the patterned rectangles.

GL semantics for copying data essentially preclude overlapping blts of any form. There's the NV_texture_barrier extension which at least lets us do blts within the same object, but even that doesn't define how overlapping blts work. So, we have to create a temporary copy for this operation to make it work right. I was worried that this would slow things down, but the Iris Pro 3D engine is enough faster than the 2D engine that even with the extra copy, large scrolls and copies within the same object are actually faster.

Results

Here's a giant image showing the ratio of Glamor to both UXA and SNA running on the same machine, with all of the same software; the only change between runs was to switch the configured acceleration architecture.