Wrapping libudev using LD_PRELOAD

Peter Hutterer and I were chasing down an X server bug which was exposed when running the libinput test suite against the X server with a separate thread for input. This was crashing deep inside libudev, which led us to suspect that libudev was getting run from multiple threads at the same time.

I figured I'd be able to tell by wrapping all of the libudev calls from the server and checking to make sure we weren't ever calling it from both threads at the same time. My first attempt was a simple set of cpp macros, but that failed when I discovered that libwacom was calling libgudev, which was calling libudev.

Instead of recompiling the world with my magic macros, I created a new library which exposes all of the (public) symbols in libudev. Each of these functions does a bit of checking and then simply calls down to the 'real' function.

Finding the real symbols

Here's the snippet which finds the real symbols:

static void *udev_symbol(const char *symbol)
    static void *libudev;
    static pthread_mutex_t  find_lock = PTHREAD_MUTEX_INITIALIZER;

    void *sym;
    if (!libudev) {
        libudev = dlopen("libudev.so.1.6.4", RTLD_LOCAL | RTLD_NOW);
    sym = dlsym(libudev, symbol);
    return sym;

Yeah, the libudev version is hard-coded into the source; I didn't want to accidentally load the wrong one. This could probably be improved...

Checking for re-entrancy

As mentioned above, we suspected that the bug was caused when libudev got called from two threads at the same time. So, our checks are pretty simple; we just count the number of calls into any udev function (to handle udev calling itself). If there are other calls in process, we make sure the thread ID for those is the same as the current thread.

static void udev_enter(const char *func) {
    assert (udev_running == 0 || udev_thread == pthread_self());
    udev_thread = pthread_self();
    udev_func[udev_running] = func;

static void udev_exit(void) {
    if (udev_running == 0)
    udev_thread = 0;
    udev_func[udev_running] = 0;

Wrapping functions

Now, the ugly part -- libudev exposes 93 different functions, with a wide variety of parameters and return types. I constructed a hacky macro, calls for which could be constructed pretty easily from the prototypes found in libudev.h, and which would construct our stub function:

#define make_func(type, name, formals, actuals)         \
    type name formals {                     \
    type ret;                       \
    static void *f;                     \
    if (!f)                         \
        f = udev_symbol(__func__);              \
    udev_enter(__func__);                   \
    ret = ((typeof (&name)) f) actuals;         \
    udev_exit();                        \
    return ret;                     \

There are 93 invocations of this macro (or a variant for void functions) which look much like:

make_func(struct udev *,
      (struct udev *udev),

Using udevwrap

To use udevwrap, simply stick the filename of the .so in LD_PRELOAD and run your program normally:

# LD_PRELOAD=/usr/local/lib/libudevwrap.so Xorg 

Source code

I stuck udevwrap in my git repository:


You can clone it using

$ git git://keithp.com/git/udevwrap
Posted Mon 15 Aug 2016 11:32:30 PM PDT Tags: tags/fdo

X.org Election Time — Vote Now

It's more important than usual to actually get your vote in — we're asking the membership to vote on changes the the X.org bylaws that are necessary for X.org to become a SPI affiliate project, instead of continuing on as a separate organization. While I'm in favor of this transition as I think it will provide much needed legal and financial help, the real reason we need everyone to vote is that we need ⅔ of the membership to cast ballots for the vote to be valid. Last time, we didn't reach that value, so even though we had a majority voting in favor of the change, it didn't take effect. If you aren't in favor of this change, I'd still encourage you to vote as I'd like to get a valid result, no matter the outcome.

Of course, we're also electing four members to the board. I'm happy to note that there are five candidates running for the four available seats, which shows that there are enough people willing to help serve the X.org community in this fashion.

Posted Tue 12 Apr 2016 11:51:37 PM PDT Tags: tags/fdo

Multi-Stream Transport 4k Monitors and X

I'm sure you've seen a 4k monitor on a friends desk running Mac OS X or Windows and are all ready to go get one so that you can use it under Linux.

Once you've managed to acquire one, I'm afraid you'll discover that when you plug it in, you're limited to 30Hz refresh rates at the full size, unless you're running a kernel that is version 3.17 or later. And then...

Good Grief! What Is My Computer Doing!

Ok, so now you're running version 3.17 and when X starts up, it's like you're using a gigantic version of Google Cardboard. Two copies of a very tall, but very narrow screen greets you.

Welcome to MST island.

In order to drive these giant new panels at full speed, there isn't enough bandwidth in the display hardware to individually paint each pixel once during each frame. So, like all good hardware engineers, they invented a clever hack.

This clever hack paints the screen in parallel. I'm assuming that they've got two bits of display hardware, each one hooked up to half of the monitor. Now, each paints only half of the pixels, avoiding costly redesign of expensive silicon, at least that's my surmise.

In the olden days, if you did this, you'd end up running two monitor cables to your computer, and potentially even having two video cards. Today, thanks to the magic of Display Port Multi-Stream Transport, we don't need all of that; instead, MST allows us to pack multiple cables-worth of data into a single cable.

I doubt the inventors of MST intended it to be used to split a single LCD panel into multiple "monitors", but hardware engineers are clever folk and are more than capable of abusing standards like this when it serves to save a buck.

Turning Two Back Into One

We've got lots of APIs that expose monitor information in the system, and across which we might be able to wave our magic abstraction wand to fix this:

  1. The KMS API. This is the kernel interface which is used by all graphics stuff, including user-space applications and the frame buffer console. Solve the problem here and it works everywhere automatically.

  2. The libdrm API. This is just the KMS ioctls wrapped in a simple C library. Fixing things here wouldn't make fbcons work, but would at least get all of the window systems working.

  3. Every 2D X driver. (Yeah, we're trying to replace all of these with the one true X driver). Fixing the problem here would mean that all X desktops would work. However, that's a lot of code to hack, so we'll skip this.

  4. The X server RandR code. More plausible than fixing every driver, this also makes X desktops work.

  5. The RandR library. If not in the X server itself, how about over in user space in the RandR protocol library? Well, the problem here is that we've now got two of them (Xlib and xcb), and the xcb one is auto-generated from the protocol descriptions. Not plausible.

  6. The Xinerama code in the X server. Xinerama is how we did multi-monitor stuff before RandR existed. These days, RandR provides Xinerama emulation, but we've been telling people to switch to RandR directly.

  7. Some new API. Awesome. Ok, so if we haven't fixed this in any existing API we control (kernel/libdrm/X.org), then we effectively dump the problem into the laps of the desktop and application developers. Given how long it's taken them to adopt current RandR stuff, providing yet another complication in their lives won't make them very happy.

All Our APIs Suck

Dave Airlie merged MST support into the kernel for version 3.17 in the simplest possible fashion -- pushing the problem out to user space. I was initially vaguely tempted to go poke at it and try to fix things there, but he eventually convinced me that it just wasn't feasible.

It turns out that all of our fancy new modesetting APIs describe the hardware in more detail than any application actually cares about. In particular, we expose a huge array of hardware objects:

  • Subconnectors
  • Connectors
  • Outputs
  • Video modes
  • Crtcs
  • Encoders

Each of these objects exposes intimate details about the underlying hardware -- which of them can work together, and which cannot; what kinds of limits are there on data rates and formats; and pixel-level timing details about blanking periods and refresh rates.

To make things work, some piece of code needs to actually hook things up, and explain to the user why the configuration they want just isn't possible.

The sticking point we reached was that when an MST monitor gets plugged in, it needs two CRTCs to drive it. If one of those is already in use by some other output, there's just no way you can steal it for MST mode.

Another problem -- we expose EDID data and actual video mode timings. Our MST monitor has two EDID blocks, one for each half. They happen to describe how they're related, and how you should configure them, but if we want to hide that from the application, we'll have to pull those EDID blocks apart and construct a new one. The same goes for video modes; we'll have to construct ones for MST mode.

Every single one of our APIs exposes enough of this information to be dangerous.

Every one, except Xinerama. All it talks about is a list of rectangles, each of which represents a logical view into the desktop. Did I mention we've been encouraging people to stop using this? And that some of them listened to us? Foolishly?

Dave's Tiling Property

Dave hacked up the X server to parse the EDID strings and communicate the layout information to clients through an output property. Then he hacked up the gnome code to parse that property and build a RandR configuration that would work.

Then, he changed to RandR Xinerama code to also parse the TILE properties and to fix up the data seen by application from that.

This works well enough to get a desktop running correctly, assuming that desktop uses Xinerama to fetch this data. Alas, gtk has been "fixed" to use RandR if you have RandR version 1.3 or later. No biscuit for us today.

Adding RandR Monitors

RandR doesn't have enough data types yet, so I decided that what we wanted to do was create another one; maybe that would solve this problem.

Ok, so what clients mostly want to know is which bits of the screen are going to be stuck together and should be treated as a single unit. With current RandR, that's some of the information included in a CRTC. You pull the pixel size out of the associated mode, physical size out of the associated outputs and the position from the CRTC itself.

Most of that information is available through Xinerama too; it's just missing physical sizes and any kind of labeling to help the user understand which monitor you're talking about.

The other problem with Xinerama is that it cannot be configured by clients; the existing RandR implementation constructs the Xinerama data directly from the RandR CRTC settings. Dave's Tiling property changes edit that data to reflect the union of associated monitors as a single Xinerama rectangle.

Allowing the Xinerama data to be configured by clients would fix our 4k MST monitor problem as well as solving the longstanding video wall, WiDi and VNC troubles. All of those want to create logical monitor areas within the screen under client control

What I've done is create a new RandR datatype, the "Monitor", which is a rectangular area of the screen which defines a rectangular region of the screen. Each monitor has the following data:

  • Name. This provides some way to identify the Monitor to the user. I'm using X atoms for this as it made a bunch of things easier.

  • Primary boolean. This indicates whether the monitor is to be considered the "primary" monitor, suitable for placing toolbars and menus.

  • Pixel geometry (x, y, width, height). These locate the region within the screen and define the pixel size.

  • Physical geometry (width-in-millimeters, height-in-millimeters). These let the user know how big the pixels will appear in this region.

  • List of outputs. (I think this is the clever bit)

There are three requests to define, delete and list monitors. And that's it.

Now, we want the list of monitors to completely describe the environment, and yet we don't want existing tools to break completely. So, we need some way to automatically construct monitors from the existing RandR state while still letting the user override portions of it as needed to explain virtual or tiled outputs.

So, what I did was to let the client specify a list of outputs for each monitor. All of the CRTCs which aren't associated with an output in any client-defined monitor are then added to the list of monitors reported back to clients. That means that clients need only define monitors for things they understand, and they can leave the other bits alone and the server will do something sensible.

The second tricky bit is that if you specify an empty rectangle at 0,0 for the pixel geometry, then the server will automatically compute the geometry using the list of outputs provided. That means that if any of those outputs get disabled or reconfigured, the Monitor associated with them will appear to change as well.

Current Status

Gtk+ has been switched to use RandR for RandR versions 1.3 or later. Locally, I hacked libXrandr to override the RandR version through an environment variable, set that to 1.2 and Gtk+ happily reverts back to Xinerama and things work fine. I suspect the plan here will be to have it use the new Monitors when present as those provide the same info that it was pulling out of RandR's CRTCs.

KDE appears to still use Xinerama data for this, so it "just works".

Where's the code

As usual, all of the code for this is in a collection of git repositories in my home directory on fd.o:

git://people.freedesktop.org/~keithp/randrproto master
git://people.freedesktop.org/~keithp/libXrandr master
git://people.freedesktop.org/~keithp/xrandr master
git://people.freedesktop.org/~keithp/xserver randr-monitors

RandR protocol changes

Here's the new sections added to randrproto.txt


1.5. Introduction to version 1.5 of the extension

Version 1.5 adds monitors

 • A 'Monitor' is a rectangular subset of the screen which represents
   a coherent collection of pixels presented to the user.

 • Each Monitor is be associated with a list of outputs (which may be

 • When clients define monitors, the associated outputs are removed from
   existing Monitors. If removing the output causes the list for that
   monitor to become empty, that monitor will be deleted.

 • For active CRTCs that have no output associated with any
   client-defined Monitor, one server-defined monitor will
   automatically be defined of the first Output associated with them.

 • When defining a monitor, setting the geometry to all zeros will
   cause that monitor to dynamically track the bounding box of the
   active outputs associated with them

This new object separates the physical configuration of the hardware
from the logical subsets  the screen that applications should
consider as single viewable areas.

1.5.1. Relationship between Monitors and Xinerama

Xinerama's information now comes from the Monitors instead of directly
from the CRTCs. The Monitor marked as Primary will be listed first.


5.6. Protocol Types added in version 1.5 of the extension

          primary: BOOL
          automatic: BOOL
          x: INT16
          y: INT16
          width: CARD16
          height: CARD16
          width-in-millimeters: CARD32
          height-in-millimeters: CARD32
          outputs: LISTofOUTPUT }


7.5. Extension Requests added in version 1.5 of the extension.

    window : WINDOW
    timestamp: TIMESTAMP
    monitors: LISTofMONITORINFO
    Errors: Window

    Returns the list of Monitors for the screen containing

    'timestamp' indicates the server time when the list of
    monitors last changed.

    window : WINDOW
    Errors: Window, Output, Atom, Value

    Create a new monitor. Any existing Monitor of the same name is deleted.

    'name' must be a valid atom or an Atom error results.

    'name' must not match the name of any Output on the screen, or
    a Value error results.

    If 'info.outputs' is non-empty, and if x, y, width, height are all
    zero, then the Monitor geometry will be dynamically defined to
    be the bounding box of the geometry of the active CRTCs
    associated with them.

    If 'name' matches an existing Monitor on the screen, the
    existing one will be deleted as if RRDeleteMonitor were called.

    For each output in 'info.outputs, each one is removed from all
    pre-existing Monitors. If removing the output causes the list of
    outputs for that Monitor to become empty, then that Monitor will
    be deleted as if RRDeleteMonitor were called.

    Only one monitor per screen may be primary. If 'info.primary'
    is true, then the primary value will be set to false on all
    other monitors on the screen.

    RRSetMonitor generates a ConfigureNotify event on the root
    window of the screen.

    window : WINDOW
    name: ATOM
    Errors: Window, Atom, Value

    Deletes the named Monitor.

    'name' must be a valid atom or an Atom error results.

    'name' must match the name of a Monitor on the screen, or a
    Value error results.

    RRDeleteMonitor generates a ConfigureNotify event on the root
    window of the screen.

Posted Wed 17 Dec 2014 01:36:43 AM PST Tags: tags/fdo

Present and Compositors

The current Present extension is pretty unfriendly to compositing managers, causing an extra frame of latency between the applications operation and the scanout buffer. Here's how I'm fixing that.

An extra frame of lag

When an application uses PresentPixmap, that operation is generally delayed until the next vblank interval. When using X without composting, this ensures that the operation will get started in the vblank interval, and, if the rendering operation is quick enough, you'll get the frame presented without any tearing.

When using a compositing manager, the operation is still delayed until the vblank interval. That means that the CopyArea and subsequent Damage event generation don't occur until the display has already started the next frame. The compositing manager receives the damage event and constructs a new frame, but it also wants to avoid tearing, so that frame won't get displayed immediately, instead it'll get delayed until the next frame, introducing the lag.

Copy now, complete later

While away from the keyboard this morning, I had a sudden idea -- what if we performed the CopyArea and generated Damage right when the PresentPixmap request was executed but delayed the PresentComplete event until vblank happened.

With the contents updated and damage delivered, the compositing manager can immediately start constructing a new scene for the upcoming frame. When that is complete, it can also use PresentPixmap (either directly or through OpenGL) to queue the screen update.

If it's fast enough, that will all happen before vblank and the application contents will actually appear at the desired time.

Now, at the appointed vblank time, the PresentComplete event will get delivered to the client, telling it that the operation has finished and that its contents are now on the screen. If the compositing manager was quick, this event won't even be a lie.

We'll be lying less often

Right now, the CopyArea, Damage and PresentComplete operations all happen after the vblank has passed. As the compositing manager delays the screen update until the next vblank, then every single PresentComplete event will have the wrong UST/MSC values in it.

With the CopyArea happening immediately, we've a pretty good chance that the compositing manager will get the application contents up on the screen at the target time. When this happens, the PresentComplete event will have the correct values in it.

How can we do better?

The only way to do better is to have the PresentComplete event generated when the compositing manager displays the frame. I've talked about how that should work, but it's a bit twisty, and will require changes in the compositing manager to report the association between their PresentPixmap request and the applications' PresentPixmap requests.

Where's the code

I've got a set of three patches, two of which restructure the existing code without changing any behavior and a final patch which adds this improvement. Comments and review are encouraged, as always!

git://people.freedesktop.org/~keithp/xserver.git present-compositor
Posted Sat 13 Dec 2014 12:28:06 AM PST Tags: tags/fdo

Glamor Cleanup

Before I start really digging in to reworking the Render support in Glamor, I wanted to take a stab at cleaning up some cruft which has accumulated in Glamor over the years. Here's what I've done so far.

Get rid of the Intel fallback paths

I think it's my fault, and I'm sorry.

The original Intel Glamor code has Glamor implement accelerated operations using GL, and when those fail, the Intel driver would fall back to its existing code, either UXA acceleration or software. Note that it wasn't Glamor doing these fallbacks, instead the Intel driver had a complete wrapper around every rendering API, calling special Glamor entry points which would return FALSE if GL couldn't accelerate the specified operation.

The thinking was that when GL couldn't do something, it would be far faster to take advantage of the existing UXA paths than to have Glamor fall back to pulling the bits out of GL, drawing to temporary images with software, and pushing the bits back to GL.

And, that may well be true, but what we've managed to prove is that there really aren't any interesting rendering paths which GL can't do directly. For core X, the only fallbacks we have today are for operations using a weird planemask, and some CopyPlane operations. For Render, essentially everything can be accelerated with the GPU.

At this point, the old Intel Glamor implementation is a lot of ugly code in Glamor without any use. I posted patches to the Intel driver several months ago which fix the Glamor bits there, but they haven't seen any review yet and so they haven't been merged, although I've been running them since 1.16 was released...

Getting rid of this support let me eliminate all of the _nf functions exported from Glamor, along with the GLAMOR_USE_SCREEN and GLAMOR_USE_PICTURE_SCREEN parameters, along with the GLAMOR_SEPARATE_TEXTURE pixmap type.

Force all pixmaps to have exact allocations

Glamor has a cache of recently used textures that it uses to avoid allocating and de-allocating GL textures rapidly. For pixmaps small enough to fit in a single texture, Glamor would use a cache texture that was larger than the pixmap.

I disabled this when I rewrote the Glamor rendering code for core X; that code used texture repeat modes for tiles and stipples; if the texture wasn't the same size as the pixmap, then texturing would fail.

On the Render side, Glamor would actually reallocate pixmaps used as repeating texture sources. I could have fixed up the core rendering code to use this, but I decided instead to just simplify things and eliminate the ability to use larger textures for pixmaps everywhere.

Remove redundant pixmap and screen private pointers

Every Glamor pixmap private structure had a pointer back to the pixmap it was allocated for, along with a pointer to the the Glamor screen private structure for the related screen. There's no particularly good reason for this, other than making it possible to pass just the Glamor pixmap private around a lot of places. So, I removed those pointers and fixed up the functions to take the necessary extra or replaced parameters.

Similarly, every Glamor fbo had a pointer back to the Glamor screen private too; I removed that and now pass the Glamor screen private parameter as needed.

Reducing pixmap private complexity

Glamor had three separate kinds of pixmap private structures, one for 'normal' pixmaps (those allocated by them selves in a single FBO), one for 'large' pixmaps, where the pixmap was tiled across many FBOs, and a third for 'atlas' pixmaps, which presumably would be a single FBO holding multiple pixmaps.

The 'atlas' form was never actually implemented, so it was pretty easy to get rid of that.

For large vs normal pixmaps, the solution was to move the extra data needed by large pixmaps into the same structure as that used by normal pixmaps and simply initialize those elements correctly in all cases. Now, most code can ignore the difference and simply walk the array of FBOs as necessary.

The other thing I did was to shrink the number of possible pixmap types from 8 down to three. Glamor now exposes just these possible pixmap types:

  • GLAMOR_MEMORY. This is a software-only pixmap, stored in regular memory and only drawn with software. This is used for 1bpp pixmaps, shared memory pixmaps and glyph pixmaps. Most of the time, these pixmaps won't even get a Glamor pixmap private structure allocated, but if you use one of these with the existing Render acceleration code, that will end up wanting a private pointer. I'm hoping to fix the code so we can just use a NULL private to indicate this kind of pixmap.

  • GLAMOR_TEXTURE. This is a full Glamor pixmap, capable of being used via either GL or software fallbacks.

  • GLAMOR_DRM_ONLY. This is a pixmap based on an FBO which was passed from the driver, and for which Glamor couldn't get the underlying DRM object. I think this is an error, but I don't quite understand what's going on here yet...

Future Work

  • Deal with X vs GL color formats
  • Finish my new CompositeGlyphs code
  • Create pure shader-based gradients
  • Rewrite Composite to use the GPU for more computation
  • Take another stab at doing GPU-accelerated trapezoids
Posted Thu 30 Oct 2014 12:51:39 AM PDT Tags: tags/fdo

Chromium (the browser) and DRI3

I got a note on IRC a week ago that Chromium was crashing with DRI3.

The Google team working on Chromium eventually sent me a link to the bug report. That's secret Google stuff, so you won't be able to follow the link, even though it's a bug in a free software application when running on free software drivers.

There's a bug report in the freedesktop bugzilla which looks the same to me.

In both cases, the recommended “fix” was to switch from DRI3 back to DRI2. That's not exactly a great plan, given that DRI3 offers better security between GPU-using applications, which seems like a pretty nice thing to have when you're running random GL applications from the web.

Chromium Sandboxing

I'm not entirely sure how it works, but Chromium creates a process separate from the main browser engine to talk to the GPU. That process has very limited access to the operating system via some fancy library adventures. Presumably, the hope is that security bugs in the GL driver would be harder to leverage into a remote system exploit.

Debugging in this environment is a bit tricky as you can't simply run chromium under gdb and expect to be able to set breakpoints in the GL driver. Instead, you have to run chromium with a magic flag which causes the GPU process to pause before loading the driver so you can connect to it with gdb and debug from there, along with a flag that lets you see crashes within the gpu process and the usual flag that causes chromium to ignore the GPU black list which seems to always include the Intel driver for one reason or another:

$ chromium --gpu-startup-dialog --disable-gpu-watchdog --ignore-gpu-blacklist

Once Chromium starts up, it will print out a message telling you to attach gdb to the GPU process and send that process a SIGUSR1 to continue it. Now you can happily debug and get a stack trace when the crash occurs.

Locating the Bug

The bug manifested with a segfault at the first access to a DRI3-allocated buffer within the application. We've seen this problem in the past; whenever buffer allocation fails for some reason, the driver ignores the problem and attempts to de-reference through the (NULL) buffer pointer, causing a segfault. In this case, Chromium called glClear, which tried (and failed) to allocate a back buffer causing the i965 driver to subsequently segfault.

We should probably go fix the i965 driver to not segfault when buffer allocation fails, but that wouldn't provide a lot of additional information. What I have done is add some error messages in the DRI3 buffer allocation path which at least tell you why the buffer allocation failed. That patch has been merged to Mesa master, and should also get merged to the Mesa stable branch for the next stable release.

Once I had added the error messages, it was pretty easy to see what happened:

$ chromium --ignore-gpu-blacklist
[10618:10643:0930/200525:ERROR:nss_util.cc(856)] After loading Root Certs, loaded==false: NSS error code: -8018
libGL: pci id for fd 12: 8086:0a16, driver i965
libGL: OpenDriver: trying /local-miki/src/mesa/mesa/lib/i965_dri.so
libGL: Can't open configuration file /home/keithp/.drirc: Operation not permitted.
libGL: Can't open configuration file /home/keithp/.drirc: Operation not permitted.
libGL error: DRI3 Fence object allocation failure Operation not permitted

The first two errors were just the sandbox preventing Mesa from using my GL configuration file. I'm not sure how that's a security problem, but it shouldn't harm the driver much.

The last error is where the problem lies. In Mesa, the DRI3 implementation uses a chunk of shared memory to hold a fence object that lets Mesa know when buffers are idle without using the X connection. That shared memory segment is allocated by creating a temporary file using the O_TMPFILE flag:

fd = open("/dev/shm", O_TMPFILE|O_RDWR|O_CLOEXEC|O_EXCL, 0666);

This call “cannot fail” as /dev/shm is used by glibc for shared memory objects, and must therefore be world writable on any glibc system. However, with the Chromium sandbox enabled, it returns EPERM.

Running Without a Sandbox

Now that the bug appears to be in the sandboxing code, we can re-test with the GPU sandbox disabled:

$ chromium --ignore-gpu-blacklist --disable-gpu-sandbox

And, indeed, without the sandbox getting in the way of allocating a shared memory segment, Chromium appears happy to use the Intel driver with DRI3.

Final Thoughts

I looked briefly at the Chromium sandbox code. It looks like it needs to know intimate details of the OpenGL implementation for every possible driver it runs on; it seems to contain a fixed list of all possible files and modes that the driver will pass to open(2). That seems incredibly fragile to me, especially when used in a general Linux desktop environment. Minor changes in how the GL driver operates can easily cause the browser to stop working.

Posted Tue 30 Sep 2014 11:51:25 PM PDT Tags: tags/fdo

A Forest of X Server Changes

We've got about another month left in the X server merge window for 1.17 and I've written a small set of fixes which haven't been reviewed yet for merging. I thought I'd advertise them a bit and see if I couldn't encourage a few of you to take a look and see if they're useful, correct and complete.

All of these are in my personal X server repository:


Cleaning up the X Registry

Branch: registry-fixes

I'll bet most of you don't even know about this code. It serves as a database mapping various X enumerations to strings to aid in diagnostics. For the security extensions, SECURITY and XSELinux, it holds names for all of the request, event and errors in the core protocol and all registered extensions. For X-Resource, it has the names of the registered resource types.

The X registry gets the request, event and error data from a file, "protocol.txt", which is installed in /usr/lib/xorg/protocol.txt on my machine. It gets the resource names as a part of resource type allocation.

So, what's wrong with this? Three basic things:

  1. A simple bug -- protocol.txt is left open while the server runs. This consumes a file descriptor for no good reason.

  2. protocol.txt is read and parsed even if the security extensions aren't available. This wastes time and memory.

  3. The resource names are kept even if X-Resource isn't in use.

The fixes remove the configure options for including the registry code; these functions are only used by the above extensions, so we can tell whether to include the code based solely on whether the extensions are being built.

Getting rid of the TCP listener by default

Branch: listen-fixes

We've had the '-nolisten' option for a while now to disable inbound TCP connections. It's useful for security reasons, but we've never enabled this by default. This patch sequence provides configure options for each of the listen sockets (tcp, unix and local), leaves unix and local enabled by default and disables tcp by default.

A new option, '-listen', is added which allows the user to override the -nolisten defaults in case they actually want to use TCP connections to X.

Glamor bug fixes

branch: glamor-fixes

This branch fixes two bugs:

  1. Scale a large pixmap down to a small pixmap. This happens when you display enormous images in a web page. Iceweasel sends the whole huge image to X and uses Render to scale it to the screen. If the image is larger than a single texture, the X server splits it up into tiles, but the code which tries to perform the merged scale is just broken. Five patches fix this.

  2. Shader-based trapezoids. This code uses area coverage to compute trapezoids. That violates the Render spec, which requires point sampling. Further, the performance of these trapezoids is lower than software (by a lot). This one patch removes the code.

Present bug fixes

branch: present-fixes

A selection of small bug fixes:

  1. Clear pending flips at CloseScreen. This removes a reference to any pending flip pixmap, allowing it to be freed. Otherwise, we'll leak memory across server reset.

  2. Add support for PresentOptionCopy. This has been in the protocol spec for a while, and was completely trivial to implement. However, it never got done. One tiny little patch.

  3. Expose the Present API to drivers via sdksyms.sh. Until now, the present extension APIs have only been available inside the X server. This exposes them to drivers. This took a few cleanup patches first.

Use Present for Glamor XV

branch: glamor-present-xv

Painting XV to the screen should be done at vblank time to avoid tearing. Present offers vblank synchronized operations. Hooking those two together required a few new present APIs to expose the vblank functionality outside of the present code, then a bit of glamor code to hook up that new API to the XV bits.

Switching Glamor to a GL core profile context

branch: glamor-core-profile

This patch set is still in progress, but demonstrates how close we are. We'll be requiring OpenGL 3.3 for this so that we get texture swizzling, which is required for our single channel objects.

The changes present on the branch are:

  1. Switch single channel surfaces from GL_ALPHA to GL_RED.

  2. Use vertex array objects.

  3. Switch ephyr over to using a core 3.3 profile.

Still left to do is

  1. Switch Render code to VBOs

The core code uses VBOs everywhere, but the Render code doesn't. This means that all Render drawing fails, which makes the resulting server not very useful.

My main objective for getting this done is to reduce memory usage by about 16MB, which is the space allocated for software rendering in Mesa in case someone does something which the hardware doesn't handle, and that can only with some legacy OpenGL APIs.

Please help out!

All of these friendly little patches are looking for a bit of review so that they can get merged before the 1.17 window closes.

Posted Mon 15 Sep 2014 10:14:16 AM PDT Tags: tags/fdo

Reworking Intel Glamor

The original Intel driver Glamor support was based on the notion that it would be better to have the Intel driver capture any fall backs and try to make them faster than Glamor could do internally. Now that Glamor has reasonably complete acceleration, and its fall backs aren't terrible, this isn't as useful as it once was, and because this uses Glamor in a weird way, we're making the Glamor code harder to maintain.

Fixing the Intel driver to not use Glamor in this way took a bit of effort; the UXA support is all tied into the overall operation of the driver.

Separating out UXA functions

The first task was to just identify which functions were UXA-specific by adding "_uxa" to their names. A couple dozen sed runs and now a bunch of the driver is looking better.

Next, a pile of UXA-specific functions were actually inside the non-UXA parts of the code. Those got moved out, and a new 'intel_uxa.h" file was created to hold all of the definitions.

Finally, a few non UXA-specific functions were actually in the uxa files; those got moved over to the generic code.

Removing the Glamor paths in UXA

Each one of the UXA functions had a little piece of code at the top like:

if (uxa_screen->info->flags & UXA_USE_GLAMOR) {
    int ok = 0;

    if (uxa_prepare_access(pDrawable, UXA_GLAMOR_ACCESS_RW)) {
        ok = glamor_fill_spans_nf(pDrawable,
                      pGC, n, ppt, pwidth, fSorted);
        uxa_finish_access(pDrawable, UXA_GLAMOR_ACCESS_RW);

    if (!ok)
        goto fallback;


Pulling those out shrank the UXA code by quite a bit.

Selecting Acceleration (or not)

The intel driver only supported UXA before; Glamor was really just a slightly different mode for UXA. I switched the driver from using a bit in the UXA flags to having an 'accel' variable which could be one of three options:


I added ACCEL_NONE to give us a dumb frame buffer mode. That actually supports DRI3 so that we can bring up Mesa and run it under X before we have any acceleration code ready; avoiding a dependency loop when doing new hardware. All that it requires is a kernel that offers mode setting and buffer allocation.

Initializing Glamor

With UXA no longer supporting Glamor, it was time to plug the Glamor support into the top of the driver. That meant changing a bunch of the entry points to select appropriate Glamor or UXA functionality, instead of just calling into UXA. So, now we've got lots of places that look like:

        switch (intel->accel) {
        case ACCEL_GLAMOR:
                if (!intel_glamor_create_screen_resources(screen))
                        return FALSE;
        case ACCEL_UXA:
                if (!intel_uxa_create_screen_resources(screen))
                        return FALSE;
        case ACCEL_NONE:
                if (!intel_none_create_screen_resources(screen))
                        return FALSE;

Using a switch means that we can easily elide code that isn't wanted in a particular build. Of course 'accel' is an enum, so places which are missing one of the required paths will cause a compiler warning.

It's not all perfectly clean yet; there are piles of UXA-only paths still.

Making It Build Without UXA

The final trick was to make the driver build without UXA turned on; that took several iterations before I had the symbols sorted out appropriately.

I built the driver with various acceleration options and then tried to count the lines of source code. What I did was just list the source files named in the driver binary itself. This skips all of the header files and the render program source code, and ignores the fact that there are a bunch of #ifdef's in the uxa directory selecting between uxa, glamor and none.

    Accel                    Lines          Size(B)
    -----------             ------          -------
    none                      7143            73039
    glamor                    7397            76540
    uxa                      25979           283777
    sna                     118832          1303904

    none legacy              14449           152480
    glamor legacy            14703           156125
    uxa legacy               33285           350685
    sna legacy              126138          1395231

The 'legacy' addition supports i810-class hardware, which is needed for a complete driver.

Along The Way, Enable Tiling for the Front Buffer

While hacking the code, I discovered that the initial frame buffer allocated for the screen was created without tiling (!) because a few parameters that depend on the GTT size were not initialized until after that frame buffer was allocated. I haven't analyzed what effect this has on performance.

Page Flipping and Resize

Page flipping (or just flipping) means switching the entire display from one frame buffer to another. It's generally the fastest way of updating the screen as you don't have to copy any bits.

The trick with flipping is that a client hands you a random pixmap and you need to stuff that into the KMS API. With UXA, that's pretty easy as all pixmaps are managed through the UXA API which knows which underlying kernel BO is tied with each pixmap. Using Glamor, only the underlying GL driver knows the mapping. Fortunately (?), we have the EGL Image extension, which lets us take a random GL texture and turn it into a file descriptor for a DMA-BUF kernel object. So, we have this cute little dance:

fd = glamor_fd_from_pixmap(screen,

bo = drm_intel_bo_gem_create_from_prime(intel->bufmgr, fd, size);
    intel_glamor_get_pixmap(pixmap)->bo = bo;

That last bit remembers the bo in some local memory so we don't have to do this more than once for each pixmap. glamor_fd_from_pixmap ends up calling eglCreateImageKHR followed by gbm_bo_import and then a kernel ioctl to convert a prime handle into an fd. It's all quite round-about, but it does seem to work just fine.

After I'd gotten Glamor mostly working, I tried a few OpenGL applications and discovered flipping wasn't working. That turned out to have an unexpected consequence -- all full-screen applications would run flat-out, and not be limited to frame rate. Present 'recovers' from a failed flip queue operation by immediately performing a CopyArea; not waiting for vblank. This needs to get fixed in Present by having it re-queued the CopyArea for the right time. What I did in the intel driver was to add a bunch more checks for tiling mode, pixmap stride and other things to catch pixmaps that were going to fail before the operation was queued and forcing them to fall back to CopyArea at the right time.

The second adventure was with XRandR. Glamor has an API to fix up the screen pixmap for a new frame buffer, but that pulls the size of the frame buffer out of the pixmap instead of out of the screen. XRandR leaves the pixmap size set to the old screen size during this call; fixing that just meant getting the pixmap size set correctly before calling into glamor. I think glamor should get fixed to use the screen size rather than the pixmap size.

Painting Root before Mode set

The X server has generally done initialization in one order:

  1. Create root pixmap
  2. Set video modes
  3. Paint root window

Recently, we've added a '-background none' option to the X server which causes it to set the root window background to none and have the driver fill in that pixmap with whatever contents were on the screen before the X server started.

In a pre-Glamor world, that was done by hacking the video driver to copy the frame buffer console contents to the root pixmap as it was created. The trouble here is that the root pixmap is created long before the upper layers of the X server are ready for drawing, so you can't use the core rendering paths. Instead, UXA had kludges to call directly into the acceleration functions.

What we really want though is to change the order of operations:

  1. Create root pixmap
  2. Paint root window
  3. Set video mode

That way, the normal root window painting operation will take care of getting the image ready before that pixmap is ever used for scanout. I can use regular core X rendering to get the original frame buffer contents into the root window, and even if we're not using -background none and are instead painting the root with some other pattern (like the root weave), I get that presented without an intervening black flash.

That turned out to be really easy -- just delay the call to I830EnterVT (which sets the modes) until the server is actually running. That required one additional kludge -- I needed to tell the DIX level RandR functions about the new modes; the mode setting operation used during server init doesn't call up into RandR as RandR lists the current configuration after the screen has been initialized, which is when the modes used to be set.

Calling xf86RandR12CreateScreenResources does the trick nicely. Getting the root window bits from fbcon, setting video modes and updating the RandR/Xinerama DIX info is now all done from the BlockHandler the first time it is called.


I ran the current glamor version of the intel driver with the master branch of the X server and there were not any huge differences since my last Glamor performance evaluation aside from GetImage. The reason is that UXA/Glamor never called Glamor's image functions, and the UXA GetImage is pretty slow. Using Mesa's image download turns out to have a huge performance benefit:

1. UXA/Glamor from April
2. Glamor from today

       1                 2                 Operation
------------   -------------------------   -------------------------
     50700.0        56300.0 (     1.110)   ShmGetImage 10x10 square 
     12600.0        26200.0 (     2.079)   ShmGetImage 100x100 square 
      1840.0         4250.0 (     2.310)   ShmGetImage 500x500 square 
      3290.0          202.0 (     0.061)   ShmGetImage XY 10x10 square 
        36.5          170.0 (     4.658)   ShmGetImage XY 100x100 square 
         1.5           56.4 (    37.600)   ShmGetImage XY 500x500 square 
     49800.0        50200.0 (     1.008)   GetImage 10x10 square 
      5690.0        19300.0 (     3.392)   GetImage 100x100 square 
       609.0         1360.0 (     2.233)   GetImage 500x500 square 
      3100.0          206.0 (     0.066)   GetImage XY 10x10 square 
        36.4          183.0 (     5.027)   GetImage XY 100x100 square 
         1.5           55.4 (    36.933)   GetImage XY 500x500 square 

Running UXA from today the situation is even more dire; I suspect that enabling tiling has made CPU reads through the GTT even worse than before?

1: UXA today
2: Glamor today

       1                 2                 Operation
------------   -------------------------   -------------------------
     43200.0        56300.0 (     1.303)   ShmGetImage 10x10 square 
      2600.0        26200.0 (    10.077)   ShmGetImage 100x100 square 
       130.0         4250.0 (    32.692)   ShmGetImage 500x500 square 
      3260.0          202.0 (     0.062)   ShmGetImage XY 10x10 square 
        36.7          170.0 (     4.632)   ShmGetImage XY 100x100 square 
         1.5           56.4 (    37.600)   ShmGetImage XY 500x500 square 
     41700.0        50200.0 (     1.204)   GetImage 10x10 square 
      2520.0        19300.0 (     7.659)   GetImage 100x100 square 
       125.0         1360.0 (    10.880)   GetImage 500x500 square 
      3150.0          206.0 (     0.065)   GetImage XY 10x10 square 
        36.1          183.0 (     5.069)   GetImage XY 100x100 square 
         1.5           55.4 (    36.933)   GetImage XY 500x500 square 

Of course, this is all just x11perf, which doesn't represent real applications at all well. However, there are applications which end up doing more GetImage than would seem reasonable, and it's nice to have this kind of speed up.


I'm running this on my crash box to get some performance numbers and continue testing it. I'll switch my desktop over when I feel a bit more comfortable with how it's working. But, I think it's feature complete at this point.

Where's the Code

As usual, the code is in my personal repository. It's on the 'glamor' branch.

git://people.freedesktop.org/~keithp/xf86-video-intel  glamor
Posted Mon 21 Jul 2014 12:39:04 AM PDT Tags: tags/fdo

Current Glamor Performance

I finally managed to get a Gigabyte Brix set up running Debian so that I could do some more reasonable performance characterization of Glamor in its current state. I wanted to use this particular machine because it has enough cooling to keep from thermally throttling the CPU/GPU package.

This is running my glamor-server branch of the X server, which completes the core operation rework and then has some core X server performance improvements as well for filled and outlined arcs.

Changes in X11perf

First off, I did some analysis of what x11perf was doing and found that it wasn't quite measuring what we thought. I'm not interested in competing on x11perf numbers absolutely, I'm only interested in doing relative measurements of useful operations, so fixing the tool to measure what I want seems reasonable to me.

When x11perf was first written, it drew 100x100 rectangles tight against one another without any gap. And, it filled the window with them, drawing a 6x6 grid of 100x100 rectangles in a 600x600 window. To better exercise the rectangle code and check edge conditions better, we added a one pixel gap between the rectangles. However, we didn't reduce the number of rectangles drawn, so we ended up drawing 11 of the 36 rectangles on top of the first set of 25. Simple region computations would allow the X server to draw only 25 most of the time, skipping the redundant rectangles.

The vertical and horizontal line tests were added a while after the first set of tests, and were done without regard to how an X server might optimize for them. x11perf draws these lines packed tightly together, creating a single square of pixels for the result.

EXA, UXA and SNA all take vertical and horizontal lines and convert them to rectangles, then take the rectangles and clip them against the window clip list by computing a region from them and intersecting that with the GC composite clip. It's a completely reasonable plan, however, when you take what x11perf was drawing and run it through this code, you end up with a single solid rectangle. Which is surprisingly fast, compared with drawing individual lines.

I "fixed" the overlapping rectangle case by reducing the number of boxes drawn from 36 to 25, and I fixed the vertical and horizontal line case by spacing the lines a pixel apart.

I've pushed out these changes to my x11perf repository on freedesktop.org.

What's Fast

Things that match GL's capabilities are fast, things which don't are slow. No surprises there. What's interesting is precisely what matches GL

Patterns For Free

Because GL makes it easy to program fill patterns into the GPU, there are essentially no performance differences between solid and patterned operations.

GL Lines

Glamor uses GL lines, which can be programmed to match X semantics, to quite good effect. The only trick required was to deal with cap styles. GL never draws the final pixel in a line, while X does unless the cap style is CapNotLast. The solution was to draw an extra line segment containing a single pixel at the end of every joined set of lines for this case.

The other implicit requirement is that all zero width lines look the same. Right now, I've solved that for fill styles and raster ops as they're all drawn with the same GL operations. However, for plane masks, we're currently falling back to software, which may draw something different. Fixing that isn't impossible, it's just tedious.


Pushing all of the work of drawing core text into glamor wasn't terribly difficult, and the results are pretty spectacular.

What's Slow

We've still got room for improvement in Glamor, but there aren't any obvious show-stoppers to getting great performance for reasonable X applications anymore.

Wide Lines and Arcs

One of the speed-ups I've made in my glamor branch is to merge all of drawing of multiple filled and zero-width arcs into a single giant GL request. That turned out to both improve performance and save a bit of code. Right now, drawing wide lines and wide arcs doesn't do this, and so we suffer from submitting many smaller requests to GL. It's hard to get excited about speeding any of this up as all of the wide primitives are essentially unused these days.

Filled Polygons

Because X only lets applications draw a single polygon in each request, Glamor can't really gain any efficiency from batching work unless we start looking ahead in the X protocol stream to see if the next request is another polygon. Alternatively, we could leave the span operation pending to see if more spans were coming before the X serve went idle. Neither of these is all that exciting though; X polygons just aren't that useful.

Render Operations

These are still not structured to best fit modern GL; some work here would help a bunch. We've got a gsoc student ready to go at this though, so I expect we'll have much better numbers in a few months.

Window Operations

You wouldn't think that moving and resizing windows would be so limited by drawing performance, but x11perf tests these with tiny little windows, and each operation draws or copies only a couple of little rectangles, which makes GL quite expensive. Working on speeding up GL for small numbers of operations would help a bunch here.

Unexpected Results

Solid rectangles are actually running slower than patterned rectangles, and I really have no idea why. The CPU is close to idle during the 500x500 solid rectangle test (as you'd expect, given the workload), the vertex and fragment shaders look correct out of the compiler, and yet solid rectangles run at only 0.80 of the performance of the patterned rectangles.

GL semantics for copying data essentially preclude overlapping blts of any form. There's the NV_texture_barrier extension which at least lets us do blts within the same object, but even that doesn't define how overlapping blts work. So, we have to create a temporary copy for this operation to make it work right. I was worried that this would slow things down, but the Iris Pro 3D engine is enough faster than the 2D engine that even with the extra copy, large scrolls and copies within the same object are actually faster.


Here's a giant image showing the ratio of Glamor to both UXA and SNA running on the same machine, with all of the same software; the only change between runs was to switch the configured acceleration architecture.

Posted Sun 27 Apr 2014 01:46:50 AM PDT Tags: tags/fdo

Core Rendering with Glamor

I've hacked up the intel driver to bypass all of the UXA paths when Glamor is enabled so I'm currently running an X server that uses only Glamor for all rendering. There are still too many fall backs, and performance for some operations is not what I'd like, but it's entirely usable. It supports DRI3, so I even have GL applications running.

Core Rendering Status

I've continued to focus on getting the core X protocol rendering operations complete and correct; those remain a critical part of many X applications and are a poor match for GL. At this point, I've got accelerated versions of the basic spans functions, filled rectangles, text and copies.

GL and Scrolling

OpenGL has been on a many-year vendetta against one of the most common 2D accelerated operations -- copying data within the same object, even when that operation overlaps itself. This used to be the most performance-critical operation in X; it was used for scrolling your terminal windows and when moving windows around on the screen.

Reviewing the OpenGL 3.x spec, Eric and I both read the glCopyPixels specification as clearly requiring correct semantics for overlapping copy operations -- it says that the operation must be equivalent to reading the pixels and then writing the pixels. My CopyArea acceleration thus uses this path for the self-copy case. However, the ARB decided that having a well defined blt operation was too nice to the users, so the current 4.4 specification adds explicit language to assert that this is not well defined anymore (even in the face of the existing language which is pretty darn unambiguous).

I suspect we'll end up creating an extension that offers what we need here; applications are unlikely to stop scrolling stuff around, and GPUs (at least Intel) will continue to do what we want. This is the kind of thing that makes GL maddening for 2D graphics -- the GPU does what we want, and the GL doesn't let us get at it.

For implementations not capable of supporting the required semantic, someone will presumably need to write code that creates a temporary copy of the data.

PBOs for fall backs

For operations which Glamor can't manage, we need to fall back to using a software solution. Direct-to-hardware acceleration architectures do this by simply mapping the underlying GPU object to the CPU. GL doesn't provide this access, and it's probably a good thing as such access needs to be carefully synchronized with GPU access, and attempting to access tiled GPU objects with the CPU require either piles of CPU code to 'de-tile' accesses (ala wfb), or special hardware detilers (like the Intel GTT).

However, GL does provide a fairly nice abstraction called pixel buffer objects (PBOs) which work to speed access to GPU data from the CPU.

The fallback code allocates a PBO for each relevant X drawable, asks GL to copy pixels in, and then calls fb, with the drawable now referencing the temporary buffer. On the way back out, any potentially modified pixels are copied back through GL and the PBOs are freed.

This turns out to be dramatically faster than malloc'ing temporary buffers as it allows the GL to allocate memory that it likes, and for it to manage the data upload and buffer destruction asynchronously.

Because X pixmaps can contain many X windows (the root pixmap being the most obvious example), they are often significantly larger than the actual rendering target area. As an optimization, the code only copies data from the relevant area of the pixmap, saving considerable time as a result. There's even an interface which further restricts that to a subset of the target drawable which the Composite function uses.

Using Scissoring for Clipping

The GL scissor operation provides a single clipping rectangle. X provides a list of rectangles to clip to. There are two obvious ways to perform clipping here -- either perform all clipping in software, or hand each X clipping rectangle in turn to GL and re-execute the entire rendering operation for each rectangle.

You'd think that the former plan would be the obvious choice; clearly re-executing the entire rendering operation potentially many times is going to take a lot of time in the GPU.

However, the reality is that most X drawing occurs under a single clipping rectangle. Accelerating this common case by using the hardware clipper provides enough benefit that we definitely want to use it when it works. We could duplicate all of the rendering paths and perform CPU-based clipping when the number of rectangles was above some threshold, but the additional code complexity isn't obviously worth the effort, given how infrequently it will be used. So I haven't bothered. Most operations look like this:

Allocate VBO space for data

Fill VBO with X primitives

loop over clip rects {

This obviously out-sources as much of the problem as possible to the GL library, reducing the CPU time spent in glamor to a minimum.

A Peek at Some Code

With all of these changes in place, drawing something like a list of rectangles becomes a fairly simple piece of code:

First, make sure the program we want to use is available and can be used with our GC configuration:

prog = glamor_use_program_fill(pixmap, gc,

if (!prog)
    goto bail_ctx;

Next, allocate the VBO space and copy all of the X data into it. Note that the data transfer is simply 'memcpy' here -- that's because we break the X objects apart in the vertex shader using instancing, avoiding the CPU overhead of computing four corner coordinates.

/* Set up the vertex buffers for the points */

v = glamor_get_vbo_space(drawable->pScreen, nrect * (4 * sizeof (GLshort)), &vbo_offset);

glVertexAttribDivisor(GLAMOR_VERTEX_POS, 1);
                      4 * sizeof (GLshort), vbo_offset);

memcpy(v, prect, nrect * sizeof (xRectangle));


Finally, loop over the pixmap tile fragments, and then over the clip list, selecting the drawing target and painting the rectangles:


glamor_pixmap_loop(pixmap_priv, box_x, box_y) {
    int nbox = RegionNumRects(gc->pCompositeClip);
    BoxPtr box = RegionRects(gc->pCompositeClip);

    glamor_set_destination_drawable(drawable, box_x, box_y, TRUE, FALSE, prog->matrix_uniform, &off_x, &off_y);

    while (nbox--) {
        glScissor(box->x1 + off_x,
                  box->y1 + off_y,
                  box->x2 - box->x1,
                  box->y2 - box->y1);
        glDrawArraysInstanced(GL_TRIANGLE_STRIP, 0, 4, nrect);

GL texture size limits

X pixmaps use 16 bit dimensions for width and height, allowing them to be up to 65536 x 65536 pixels. Because the X coordinate space is signed, only a quarter of this space is actually useful, which makes the useful size of X pixmaps only 32767 x 32767. This is still larger than most GL implementations offer as a maximum texture size though, and while it would be nice to just say 'we don't allow pixmaps larger than GL textures', the reality is that many applications expect to be able to allocate such pixmaps today, largely to hold the ever increasing size of digital photographs.

Glamor has always supported large X pixmaps; it does this by splitting them up into tiles, each of which is no larger than the largest texture supported by the driver. What I've added to Glamor is some simple macros that walk over the array of tiles, making it easy for the rendering code to support large pixmaps without needing any special case code.

Glamor also had some simple testing support -- you can compile the code to ignore the system-provided maximum texture size and supply your own value. This code had gone stale, and couldn't work as there were parts of the code for which tiling support just doesn't make sense, like the glyph cache, or the X scanout buffer. I fixed things so that you could leave those special cases as solitary large tiles while breaking up all other pixmaps into tiles no larger than 32 pixels square.

I hope to remove the single-tile case and leave the code supporting only the multiple-tile case; we have to have the latter code, and so having the single-tile code around simply increases our code size for not obvious benefit.

Getting accelerated copies between tiled pixmaps added a new coordinate system to the mix and took a few hours of fussing until it was working.

Rebasing Many (many) Times

I'm sure most of us remember the days before git; changes were often monolithic, and the notion of changing how the changes were made for the sake of clarity never occurred to anyone. It used to be that the final code was the only interesting artifact; how you got there didn't matter to anyone. Things are different today; I probably spend a third of my development time communicating how the code should change with other developers by changing the sequence of patches that are to be applied.

In the case of Glamor, I've now got a set of 28 patches. The first few are fixes outside of the glamor tree that make the rest of the server work better. Then there are a few general glamor infrastructure additions. After that, each core operation is replaced, one a at a time. Finally, a bit of stale code is removed. By sequencing things in a logical fashion, I hope to make review of the code easier, which should mean that people will spend less time trying to figure out what I did and be able to spend more time figuring out if what I did is reasonable and correct.

Supporting Older Versions of GL

All of the new code uses vertex instancing to move coordinate computation from the CPU to the GPU. I'm also pulling textures apart using integer operations. Right now, we should correctly fall back to software for older hardware, but it would probably be nicer to just fall back to simpler GL instead. Unless everyone decides to go buy hardware with new enough GL driver support, someone is going to need to write simplified code paths for glamor.

If you've got such hardware, and are interested in making it work well, please take this as an opportunity to help yourself and others.

Near-term Glamor Goals

I'm pretty sure we'll have the code in good enough shape to merge before the window closes for X server 1.16. Eric is in charge of the glamor tree, so it's up to him when stuff is pulled in. He and Markus Wick have also been generating code and reviewing stuff, but we could always use additional testing and review to make the code as good as possible before the merge window closes.

Markus has expressed an interest in working on Glamor as a part of the X.org summer of code this year; there's clearly plenty of work to do here, Eric and I haven't touched the render acceleration stuff at all, and that code could definitely use some updating to use more modern GL features.

If that works as well as the core rendering code changes, then we can look forward to a Glamor which offers GPU-limited performance for classic X applications, without requiring GPU-specific drivers for every generation of every chip.

Posted Sat 22 Mar 2014 01:20:34 AM PDT Tags: tags/fdo