present sync

Implementing Vblank Synchronization in the Present Extension

This is mostly a status update on how the Present extension is doing; the big news this week is that I've finished implementing vblank synchronized blts and flips, and things seem to be working quite well.

Vblank Synchronized Blts

The goal here is to have the hardware executing the blt operation in such a way as to avoid any tearing artifacts. In current drivers, there are essentially two different ways to make this happen:

Insert a command into the ring which blocks execution until a suitable time immediately preceding the blt operation.
Queue the blt operation at vblank time so that it executes before the scanout starts.

Option 1. provides the fewest artifacts; if the hardware can blt faster than scanout, there shouldn't ever be anything untoward visible on the screen. However, it also blocks future command execution within the same context. For example, if two vblank synchronized blts are queued at the same time, its possible for the second blt to be delayed by yet another frame, causing both applications to run at half of the frame rate.

Option 2. avoids blocking the hardware, allowing for ongoing operations to proceed on the hardware without waiting for the synchronized blt to complete. However, it can cause artifacts if the delay from the vblank event to the eventual execution of the blt command is too long.

Queuing the blt right when it needs to execute means that we also have the opportunity to skip some blts; if the application presents two buffers within the same frame time, the blt of the first buffer can be skipped, saving memory bandwidth and time.

Present uses Option 2, which may occasionally cause a tearing artifact, but avoids slowing down applications while allowing the X server to discard overlapping blt operations when possible.

Queuing the Blt at Vblank

There are several options for getting the blt queued and executed when the vblank occurs:

Queue the blt from the interrupt handler
Queue the blt from a kernel thread running in response to the interrupt
Send an event up to user space and have the X server construct the blt command.

These are listed in increasing maximum latency, but also in decreasing complexity.

Option 1. is made more complicated as much of the work necessary to get a command queued to the hardware cannot be done from interrupt context. One can imagine having the desired command already present in the ring buffer and have the interrupt handler simply move the ring tail pointer value. Future operations to be queued before the vblank operation could then re-write the ring as necessary. A queued operation could also be adjusted by the X server as necessary to keep it correct across changes to the window system state.

Option 2. is similar, but the kernel implementation should be quite a bit simpler as the queuing operation is done in process context and can use the existing driver infrastructure. For the X server, this is the same as Option 1, requiring that it construct a queued blt operation and deliver that to the kernel, and then revoke and re-queue if the X server state changed before the operation was completed.

Option 3 is the simplest of all, requiring no changes within the kernel and few within the X server. The X server waits to receive a vblank notification event for the appropriate frame and then simply invokes existing mechanisms to construct and queue the blt operation to the kernel.

Oddly, Present currently uses Option 3. If that proves to generate too many display artifacts, we can come back and change the code to try something more complicated.

Flipping the Frame Buffer

Taking advantage of the hardware's ability to quickly shift scanout from one chunk of memory to another is critical to providing efficient buffer presentation within the X server. It is slightly more complicated to implement than simply copying data to the current scanout buffer for a few reasons:

The presented pixmap is owned by the application, and so it shouldn't be used except when the presented window covers the whole screen. When the window gets reconfigured, we end up copying the window's pixmap to the regular screen pixmap.
The kernel flipping API is asynchronous, and doesn't provide any abort mechanism. This isn't usually much of an issue; we simply delay reporting the actual time of flip until the kernel sends the notification event to the X server. However, if the window is reconfigured or destroyed while the flip is still pending, cleaning that up must wait until the flip has finished.
The applications buffer remains 'busy' until it is no longer being used for scanout; that means that applications will have to be aware of this and ensure that they don't deadlock waiting for the current scanout buffer to become idle before switching to a new scanout buffer.

Present is different from DRI2 in using application-allocated buffers for this operation. For DRI2, when flipping to a window buffer, that buffer becomes the screen pixmap -- the driver flips the new buffer object into the screen pixmap and releases the previous buffer object for other use. For Present, as the buffer is owned by the application, I figured it would be better to switch back to the 'real' screen buffer when necessary. This also means that applications aren't left holding a handle to the frame buffer, which seems like it might be a nice feature.

The hardest part of this work was dealing with client and server shutdown, dealing with objects getting deleted in random orders while other data structures retained references.

(The kernel DRM drivers use the term 'page flipping' to mean an atomic flip between one frame buffer and another, generally implemented by simply switching the address used for the scanout buffer. I'd like to avoid using the word 'page' in this context as we're not flipping memory pages individually, but rather a huge group of memory that forms an entire frame buffer. We could use 'plane flipping' (as intel docs do), 'frame buffer flipping' (but that's a mouthful), 'display flipping' or almost anything but 'page flipping').

Overall DRI3000 Status

At this point, the DRI3 extension is complete and the Present extension is largely complete, except for redirection for compositors. The few piglit tests for GLX_OML_sync_control all pass now, which is at least better than DRI2 does.

I think I've effectively replicated the essential features of DRI2 while offering room to implement a couple of new GL extensions:

GLX_EXT_swap_control_tear. This will provide applications with the ability to avoid dropping frames when pushing the hardware just over the frame rate limit.
EGL_EXT_buffer_age. (I assume we'll probably want a GLX version as well?) This will allow compositors to more efficiently perform partial updates in a flipping environment, and is enabled by having all of the buffer management within the GL library.

The code for this stuff has all been pushed to a number of repositories:

git://people.freedesktop.org/~keithp/dri3proto.git master. DRI3 protocol specification and X server headers.
git://people.freedesktop.org/~keithp/presentproto.git master. Present protocol specification and X server headers.
git://people.freedesktop.org/~keithp/xcb/proto.git dri3. XCB protocol defines for both DRI3 and Present.
git://people.freedesktop.org/~keithp/xcb/libxcb.git dri3. XCB library changes for file descriptor passing.
git://people.freedesktop.org/~keithp/xserver.git dri3. X server with file descriptor passing, DRI3 and Present support.
git://people.freedesktop.org/~keithp/mesa.git dri3 Mesa with DRI3/Present support for GLX.
git://people.freedesktop.org/~keithp/drm.git dri3. DRM library with defines for async flipping.
git://people.freedesktop.org/~keithp/xf86-video-intel.git dri3. Intel driver with DRI3, Present and async flipping support.
git://git.kernel.org/pub/scm/linux/kernel/git/keithp/linux-2.6 dri3. Kernel with async flipping.