Shared Memory Fences
In our last adventure, dri3k first steps, one of the ‘future work’ items was to deal with synchronization between the direct rendering application and the X server. DRI2 “handles” this by performing a round trip each time the application starts using a buffer that was being used by the X server.
As DRI3 manages buffer allocation within the application, there’s really no reason to talk to the server, so this implicit serialization point just isn’t available to us. As I mentioned last time, James Jones and Aaron Plattner added an explicit GPU serialization system to the Sync extension. These SyncFences serializing rendering between two X clients, but within the server there are hooks provided for the driver to use hardware-specific serialization primitives.
The existing Linux DRM interfaces queue rendering to the GPU in the order requests are made to the kernel, so we don’t need the ability to serialize within the GPU, we just need to serialize requests to the kernel. Simple CPU-based serialization gating access to the GPU will suffice here, at least for the current set of drivers. GPU access which is not mediated by the kernel will presumably require serialization that involves the GPU itself. We’ll leave that for a future adventure though; the goal today is to build something that works with the current Linux DRM interfaces.
SyncFence Semantics
The semantics required by SyncFences is for multiple clients to block on a fence which a single client then triggers. All of the blocked clients start executing requests immediately after the trigger fires.
There are four basic operations on SyncFences:
Trigger. Mark the fence as ready and wake up all waiting clients
Await. Block until the fence is ready.
Query. Retrieve the current state of the fence.
Reset. Unset the fence; future Await requests will block.
SyncFences are the same as Events as provided by Python and other systems. Of course all of the names have been changed to keep things interesting. I’ll call them Fences here, to be consistent with the current X usage.
Using Pthread Primitives
One fact about pthreads that I recently learned is that the synchronization primitives (mutexes, barriers and semaphores) are actually supposed to work across process boundaries, if those objects are in shared memory mapped by each process. That seemed like a great simplification for this project; allocate a page of shared memory, map into the X server and direct rendering application and use the existing pthreads APIs.
Alas, the pthread objects are architecture specific. I’m pretty sure that when that spec was written, no-one ever thought of running multiple architectures within the same memory space. I went and looked at the code to check, and found that each of these objects has a different size and structure on x86 and x86_64 architectures. That makes it pretty hard to use this API within X as we often have both 32- and 64- bit applications talking to the same (presumably 64-bit) X server.
As a last resort, I read through a bunch of articles on using futexes directly within applications and decided that it was probably possible to implement what I needed in an architecture-independent fashion.
Futexes
Linux Futexes live in this strange limbo of being a not-quite-public kernel interface. Glibc uses them internally to implement locking primitives, but it doesn’t export any direct interface to the system call. Certainly they’re easy to use incorrectly, but it’s unusual in the Linux space to have our fundamental tools locked away ‘for our own safety’.
Fortunately, we can still get at futexes by creating our own syscall wrappers.
static inline long sys_futex(void *addr1, int op, int val1,
struct timespec *timeout, void *addr2, int val3)
{
return syscall(SYS_futex, addr1, op, val1, timeout, addr2, val3);
}
For this little exercise, I created two simple wrappers, one to block on a futex:
static inline int futex_wait(int32_t *addr, int32_t value) {
return sys_futex(addr, FUTEX_WAIT, value, NULL, NULL, 0);
}
and one to wake up all futex waiters:
static inline int futex_wake(int32_t *addr) {
return sys_futex(addr, FUTEX_WAKE, MAXINT, NULL, NULL, 0);
}
Atomic Memory Operations
I need atomic memory operations to keep separate cores from seeing different values of the fence value, GCC defines a few such primitives and I picked _syncboolcompareandswap and _syncvalcompareandswap. I also need fetch and store operations that the compiler won’t shuffle around:
#define barrier() __asm__ __volatile__("": : :"memory")
static inline void atomic_store(int32_t *f, int32_t v)
{
barrier();
*f = v;
barrier();
}
static inline int32_t atomic_fetch(int32_t *a)
{
int32_t v;
barrier();
v = *a;
barrier();
return v;
}
If your machine doesn’t make these two operations atomic, then you would redefine these as needed.
Futex-based Fences
These wake-all semantics of Fences greatly simplify reasoning about the operation as there’s no need to ensure that only a single thread runs past Await, the only requirement is that no threads pass the Await operation until the fence is triggered.
A Fence is defined by a single 32-bit integer which can take one of three values:
- 0 - The fence is not triggered, and there are no waiters.
- 1 - The fence is triggered (there can be no waiters at this point).
- -1 - The fence is not triggered, and there are waiters (one or more).
With those, I built the fence operations as follows. Here’s Await:
int fence_await(int32_t *f)
{
while (__sync_val_compare_and_swap(f, 0, -1) != 1) {
if (futex_wait(f, -1)) {
if (errno != EWOULDBLOCK)
return -1;
}
}
return 0;
}
The basic requirement that the thread not run until the fence is triggered is met by fetching the current value of the fence and comparing it with 1. Until it is signaled, that comparison will return false.
The compareandswap operation makes sure the fence is -1 before the thread calls futex_wait, either it was already -1 in the case where there were other waiters, or it was 0 before and is now -1 in the case where there were no waiters before. This needs to be an atomic operation so that the fence value will be seen as -1 by the trigger operation if there are any threads in the syscall.
The futex_wait call will return once the value is no longer -1, it also ensures that the thread won’t block if the trigger occurs between the swap and the syscall.
Here’s the Trigger function:
int fence_trigger(int32_t *f)
{
if (__sync_val_compare_and_swap(f, 0, 1) == -1) {
atomic_store(f, 1);
if (futex_wake(f) < 0)
return -1;
}
return 0;
}
The atomic compareandswap operation will make sure that no Await thread swaps the 0 for a -1 while the trigger is changing the value from 0 to 1; either the Await switches from 0 to -1 or the Trigger switches from 0 to 1.
If the value before the compareandswap was -1, then there may be threads waiting on the Fence. An atomic store, constructed with two memory barriers and a regular store operation, to mark the Fence triggered is followed by the futex_wake call to unblock all Awaiting threads.
The Query function is just an atomic fetch:
int fence_query(int32_t *f)
{
return atomic_fetch(f) == 1;
}
Reset requires a compareandswap so that it doesn’t disturb things if the fence has already been reset and there are threads waiting on it:
void fence_reset(int32_t *f)
{
__sync_bool_compare_and_swap(f, 1, 0);
}
A Request for Review
Ok, so we’ve all tried to create synchronization primitives only to find that our ‘obvious’ implementations were full of holes. I’d love to hear from you if you’ve identified any problems in the above code, or if you can figure out how to use the existing glibc primitives for this operation.
DRI3K — First Steps
Here’s an update on DRI3000. I’ll start by describing what I’ve managed to get working and then summarize discussions that happened on the xorg-devel mailing list.
Private Back Buffers
One of the big goals for DRI3000 is to finish the job of moving buffer management out of the X server and into applications. The only thing still allocated by DRI2 in the X server are back buffers; everything else moved to the client side. Yes, I know, this breaks the GLX requirement for sharing buffers between applications, but we just don’t care anymore.
As a quick hack, I figured out how to do this with DRI2 today — allocate our back buffers separately by creating X pixmaps for them, and then using the existing DRI2GetBuffersWithFormat request to get a GEM handle for them.
Of course, now that all I’ve got is a pixmap, I can’t use the existing DRI2 swap buffer support, so for now I’m just using CopyArea to get stuff on the screen. But, that works fine, as long as you don’t care about synchronization.
Handling Window Resize
The biggest pain in DRI2 has been dealing with window resize. When the window resizes in the X server, a new back buffer is allocated and the old one discarded. An event is delivered to ‘invalidate’ the old back buffer, but anything done between the time the back buffer is discarded and when the application responds to the event is lost.
You can easily see this with any GL application today — resize the window and you’ll see occasional black frames.
By allocating the back buffer in the application, the application handles the resize within GL; at some point in the rendering process the resize is discovered, and GL creates a new buffer, copies the existing data over, and continues rendering. So, the rendered data are never lost, and every frame gets displayed on the screen (although, perhaps at the wrong size).
The puzzle here was how to tell that the window was resized. Ideally, we’d have the application tell us when it received the X configure notify event and was drawing the frame at the new size. We thought of a cute hack that might do this; track GL calls to change the viewport and make sure the back buffer could hold the viewport contents. In theory, the application would receive the X configure notify event, change the viewport and render at the new size.
Tracking the viewport settings for an entire frame and constructing their bounding box should describe the size of the window; at least it should describe the intended size of the window.
There’s at least one serious problem with this plan — applications may well call glClear before calling glViewport, and as glClear does not use the current viewport, instead clearing the “whole” window, we couldn’t use the viewport as an indication of the current window size.
However, what this exercise did lead us to realize was that we don’t care what size the window actually is, we only care what size the application thinks it is. More accurately, the GL library just needs to be aware of any window configuration changes before the application, so that it will construct a buffer that is not older than the application knowledge of the window size.
I came up with two possible mechanisms here; the first was to construct a shared memory block between application and X server where the X server would store window configuration changes and signal the application by incrementing a sequence number in the shared page; the GL library would simply look at the sequence number and reallocate buffers when it changed.
The problem with the shared memory plan was that it wouldn’t work across the network, and we have a future project in mind to replace GLX indirect rendering with local direct rendering and PutImage which still needs accurate window size tracking. More about that project in a future post though…
X Events to the Rescue
So, I decided to just have the X server send me events when the window size changed. I could simply use the existing X configure notify events, but that would require a huge infrastructure change in the application so that my GL library could get those events and have the application also see them. Not knowing what the application is up to, we’d have to track every ChangeWindowAttributes call and make sure the event_mask included the right bits. Ick.
Fortunately, there’s another reason to use a new event — we need more information than is provided in the ConfigureNotify event; as you know, the Swap extension wants to have applications draw their content within a larger buffer that can have the window decorations placed around it to avoid a copy from back buffer to window buffer. So, our new ConfigureNotify event would also contain that information.
Making sure that ConfigureNotify event is delivered before the core ConfigureNotify event ensures that the GL library should always be able to know about window size changes before the application.
Splitting the XCB Event Stream
Ok, so I’ve got these new events coming from the X server. I don’t want the application to have to receive them and hand them down to the GL library; that would mean changing every application on the planet, something which doesn’t seem very likely at all.
Xlib does this kind of thing by allowing applications to stick themselves into the middle of the event processing code with a callback to filter out the events they’re interested in before they hit the main event queue. That’s how DRI2 captures Invalidate events, and it “works”, but using callbacks from the middle of the X event processing code creates all kinds of locking nightmares.
As discussed above, I don’t care when GL sees the configure events, as long as it gets them before the application finds about about the window size change. So, we don’t need to synchronously handle these events, we just need to be able to know they’ve arrived and then handle them on the next call to a GL drawing function.
What I’ve created as a prototype is the ability to identify specific events and place them in a separate event queue, and when events are placed in that event queue, to bump a ‘sequence number’ so that the application can quickly identify that there’s something to process.
Making the Event Mask Per-API Instead of Per-Client
The problem described above about using the core ConfigureNotify events made me think about how to manage multiple APIs all wanting to track window configuration. For core events, the selection of which events to receive is all based on the client; each client has a single event mask, and each client receives one copy of each event.
Monolithic applications work fine with this model; there’s one place in the application selecting for events and one place processing them. However, modern applications end up using different APIs for 3D, 2D and media. Getting those libraries to cooperate and use a common API for event management seems pretty intractable. Making the X server treat each API as a separate entity seemed a whole lot easier; if two APIs want events, just have them register separately and deliver two events flagged for the separate APIs.
So, the new DRI3 configure notify events are created with their own XID to identify the client-side owner of the event. Within the X server, this required a tiny change; we already needed to allocate an XID for each event selection so that it could be automatically cleaned up when the client exited, so the only change was to use the one provided by the client instead of allocating one in the server.
On the wire, the event includes this new XID so that the library can use it to sort out which event queue to stick the event in using the new XCB event stream splitting code.
Current Status
The above section describes the work that I’ve got running; with it, I can run GL applications and have them correctly track window size changes without losing a frame. It’s all available on the ‘dri3’ branches of my various repositories for xcb proto, libxcb, dri3proto and the X server.
Future Directions
The first obvious change needed is to move the configuration events from the DRI3 extension to the as-yet-unspecified new ‘Swap’ extension (which I may rename as ‘Present’, as in ‘please present this pixmap in this window’). That’s because they aren’t related to direct rendering, but rather to tracking window sizes for off-screen rendering, either direct, indirect or even with the CPU to memory.
DRI3 and Fences
Right now, I’m not synchronizing the direct rendering with the CopyArea call; that means the X server will end up with essentially random contents as the application may be mid-way through the next frame before it processes the CopyArea. A simple XSync call would suffice to fix that, but I want a more efficient way of doing this.
With the current Linux DRI kernel APIs, it is sufficient to serialize calls that post rendering requests to the kernel to ensure that the rendering requests are themselves serialized. So, all I need to do is have the application wait until the X server has sent the CopyArea request down to the kernel.
I could do that by having the X server send me an X event, but I think there’s a better way that will extend to systems that don’t offer the kernel serialization guarantee. James Jones and Aaron Plattner put together a proposal to add Fences to the X Sync extension. In the X world, those offer a method to serialize rendering between two X applications, but of course the real goal is to expose those fences to GL applications through the various GL sync extensions (including GLARBsync and GLNVfence).
With the current Linux DRI implementation, I think it would be pretty easy to implement these fences using pthread semaphores in a block of memory shared between the server and application. That would be DRI-specific; other direct rendering interfaces would use alternate means to share the fences between X server and application.
Swap/Present — The Second Extension
By simply using CopyArea for my application presentation step, I think I’ve neatly split this problem into manageable pieces. Once I’ve got the DRI3 piece working, I’ll move on to fixing the presentation issue.
By making that depend solely on existing core Pixmap objects as the source of data to present, I can develop that without any reference to DRI. This will make the extension useful to existing X applications that currently have only CopyArea for this operation.
Presentation of application contents occurs in two phases; the first is to identify which objects are involved in the presentation. The second is to perform the presentation operation, either using CopyArea, or by swapping pages or the entire frame buffer. For offscreen objects, these can occur at the same time. For onscreen, the presentation will likely be synchronized with the scanout engine.
The second form will mean that the Fences that mark when the presentation has occurred will need to signaled only once the operation completes.
A CopyArea operation means that the source pixmap is “ready” immediately after the Copy has completed. Doing the presentation by using the source pixmap as the new front buffer means that the source pixmap doesn’t become “ready” until after the next swap completes.
What I don’t know now is whether we’ll need to report up-front whether the presentation will involve a copy or a swap. At this point, I don’t think so — the application will need two back buffers in all cases to avoid blocking between the presentation request and the presentation execution. Yes, it could use a fence for this, but that still sticks a bubble in the 3D hardware where it’s blocked waiting for vblank instead of starting on the next frame immediately.
Plan of Attack
Right now, I’m working on finishing up the DRI3 piece:
Replace the DRI2 buffer allocation kludge with actual local buffer allocation, mapping them into pixmaps using FD passing.
Replace the DRI2 authentication scheme with having the X server open the DRI object, preparing it for rendering and passing it back to the application.
Working on the XCB pieces to get the split event-queue stuff landed upstream.
Implementing the Fencing stuff to correctly serialize access to the pixmap.
The first three seem fairly straight forward. The fencing stuff will involve working with James and Aaron to integrate their XSync changes into the server.
After that, I’ll start working on the presentation piece. Foremost there is figuring out the right name for this new extension; I started with the name ‘Swap’ as that’s the GL call it implements. However, ‘Swap’ is quite misleading as to the actual functionality; a name more like ‘Present’ might provide a better indication of what it actually does. Of course, ‘Present’ is both a verb and a noun, with very different connotations. Suggestions on this most complicated part of the project are welcome!
Composite and Swap — Getting it Right
Where the author tries to make sure DRI3000 is going to do what we want now and in the future
DRI3000
The basic DRI3000 plan seems pretty straightforward:
Have applications allocate buffers full of new window contents, attach pixmap IDs to those buffers and pass them to the X server to get them onto the screen.
Provide a mechanism to let applications know when those pixmaps are idle so that they can reuse them instead of creating new ones for every frame.
Finally, allow the actual presentation of the contents to be scheduled for a suitable time in the future, generally synchronized with the monitor. Let the client know when this has happened in case they want to synchronize themselves to vblank.
The DRI3 extension provides a way to associate pixmap IDs and buffers, and given the MIT-SHM prototype I’ve already implemented, I think we can safely mark this part as demonstrably implementable.
That leaves us with a smaller problem, that of taking pixmap contents and presenting them on the screen at a suitable time and telling applications about the progress of that activity.
In the absence of compositing, I’m pretty sure the initial Swap extension design would do this job just fine, and should resolve some of the known DRI2 limitations related to buffer management. And, I think that goal is sufficient motivation to go and implement that. However, I wanted to write up some further ideas to see if the DRI3000 plan can be made to do precisely what we want in a composited world.
The Composited Goal
To make sure we’re all on the same page, here’s what I expect from the Swap extension in a composited world:
Application calls Swap with new window pixmap
Compositor hears about the new pixmap and uses that to construct a new screen pixmap
Compositor calls Swap with new screen pixmap
Vertical retrace happens, executing the pending swap operation
Compositor hears about the swap completion for the screen
Application hears about the swap completion for its window
In particular, applications should not hear that their swap operations are complete until the contents appear on the screen. This allows for applications to throttle themselves to the screen rate, either doing double or triple buffering as they choose.
I didn’t add steps here indicating buffers going idle or being allocated, because I think that should all happen ‘behind the scenes’ from the application’s perspective. Many applications won’t care about the swap completion notification either, but some will and so that needs to be visible.
Redirected Swaps?
Owen Taylor suggested that one way of getting the compositor involved would be to have it somehow ‘redirect’ Swap operations, much like we do with window management operations today. I think that idea may be a good direction to try:
Application calls Swap with new window pixmap
Swap is redirected to compositor, passing along the new window pixmap
Compositor constructs a new screen pixmap using the new window pixmap
Compositor calls Swap on the screen and the window, passing the new screen pixmap and the new window pixmap. When the screen update occurs, the screen and the window both receive swap completion events.
This has the added benefit that the X server knows when the compositor is expecting window pixmaps to change like this — the compositor has to explicitly request Swap redirection.
Window Pixmap Names and GEM Buffer Handles
One issue that swapping window pixmaps around like this brings up is how to manage existing names for the window pixmap. Right now, applications expect that window pixmaps will only change when the window is resized. If the Swap extension is going to actually replace the window pixmap when running with a suitable compositor, then we need to figure out what the old names will reference.
Are there non-compositor applications using NameWindowPixmap that matter to us? How about non-compositor applications using TextureFromPixmap to get a GEM handle for a window pixmap? For now, I’m very tempted to just break stuff and see who complains, but knowing what we’re breaking might be nice beforehand.
Idling Pixmaps
When an application is done drawing to a window pixmap and has passed it off to the X server for presentation, we’d like for that pixmap to be automatically marked as discardable as soon as possible. This way, when memory is tight, the kernel can come steal those pages for something critical. Of course, applications may not want to let the server mark the pixmap as idle after being used, so a flag to the Swap call would be needed.
Ideally, the pixmap would become idle immediately after the pixmap contents have been extracted. In the absence of a compositor, that would probably be when the Swap operation completes. With a compositor running, we’d need explicit instruction from the compositor telling us that the window pixmap was now ‘idle’:
┌───
SwapIdle
drawable: Drawable
pixmap: Pixmap
▶
└───
Furthermore, the application needs to know that the pixmap is in fact idle. I think that we’ll need a synchronous X request that marks a buffer as ‘no longer idle’ and have that return whether the buffer was discarded while idle. It doesn’t seem sufficient to use events here as the application will need to completely reconstruct the pixmap contents in this case. This reply could also contain information about precisely what contents the pixmap does contain.
┌───
SwapReuse
drawable: Drawable
pixmap: Pixmap
▶
valid: BOOL
swap-hi: CARD32
swap-lo: CARD32
└───
Pixmap Lifetimes and Triple Buffered Applications
If we redirect the Swap operation and send the original application window pixmap ID to the compositor, what happens when the application frees that pixmap before the compositor gets around to using the contents?
Surely the Compositor must handle such cases, and not just crash. However, I’m fine with requiring that the application not free the pixmap until told by the compositor.
x-on-resize: a simple display configuration daemon
I like things to be automated as much as possible, and having abandoned Gnome to their own fate and switched to xfce, I missed the automatic display reconfiguration stuff. I decided to write something as simple as possible that did just what I needed. I did this a few months ago, and when Carl Worth asked what I was using, I decided to pack it up and make it available.
Automatic configuration with a shell script
I’ve had a shell script around that I used to bind to a key press which I’d hit when I plugged or unplugged a monitor. So, all I really need to do is get this script run when something happens.
The missing tool here was something to wait for a change to happen and automatically invoke the script I’d already written.
Resize vs Configure
The first version of x-on-resize just listened for ConfigureNotify events on the root window. These get sent every time anything happens with the screen configuration, from hot-plug to notification when someone runs xrandr. That was as simple as possible; the application was a few lines of code to select for ConfigureNotify events, and invoke a program provided on the command line.
However, it was a bit too simple as it would also respond to manual invocations of xrandr and call the script then as well. So, as long as I was content to accept whatever the script did, things were fine. And, with a laptop that had a DisplayPort connector for my external desktop monitor, and a separate VGA connector for projectors at conferences, the script always did something useful.
Then I got this silly laptop that has only DisplayPort, and for which a dongle is required to get to VGA for projectors. I probably could write something fancy to figure out the difference between a desktop DisplayPort monitor and DisplayPort to VGA dongle, but I decided that solving the simpler problem of only invoking the script on actual hotplug events would be better.
So, I left the current invoke-on-resize behavior intact and added new code that watched the list of available outputs and invoked a new ‘config’ script when that set changed.
The final program, x-on-resize, is available via git at
git://people.freedesktop.org/~keithp/x-on-resize
I even wrote a manual page. Enjoy!
DRI3000 — Even Better Direct Rendering
This all started with the presentation that Eric Anholt and I did at the 2012 X developers conference, and subsequently wrote about in my DRI-Next posting. That discussion sketched out the goals of changing the existing DRI2-based direct rendering infrastructure.
Last month, I gave a more detailed presentation at Linux.conf.au 2013 (the best free software conference in the world). That presentation was recorded, so you can watch it online. Or, you can read Nathan Willis’ summary at lwn.net. That presentation contained a lot more details about the specific techniques that will be used to implement the new system, in particular it included some initial indications of what kind of performance benefits the overall system might be able to produce.
I sat down today and wrote down an initial protocol definition for two new extensions (because two extensions are always better than one). Together, these are designed to provide complete support for direct rendering APIs like OpenGL and offer a better alternative to DRI2.
The DRI3 extension
Dave Airlie and Eric Anholt refused to let me call either actual extension DRI3000, so the new direct rendering extension is called DRI3. It uses POSIX file descriptor passing to share kernel objects between the X server and the application. DRI3 is a very small extension in three requests:
Open. Returns a file descriptor for a direct rendering device along with the name of the driver for a particular API (OpenGL, Video, etc).
PixmapFromBuffer. Takes a kernel buffer object (Linux uses DMA-BUF) and creates a pixmap that references it. Any place a Pixmap can be used in the X protocol, you can now talk about a DMA-BUF object. This allows an application to do direct rendering, and then pass a reference to those results directly to the X server.
BufferFromPixmap. This takes an existing pixmap and returns a file descriptor for the underlying kernel buffer object. This is needed for the GL Texture from Pixmap extension.
For OpenGL, the plan is to create all of the buffer objects on the client side, then pass the back buffer to the X server for display on the screen. By creating pixmaps, we avoid needing new object types in the X server and can use existing X apis that take pixmaps for these objects.
The Swap extension
Once you’ve got direct rendered content in a Pixmap, you’ll want to display it on the screen. You could simply use CopyArea from the pixmap to a window, but that isn’t synchronzied to the vertical retrace signal. And, the semantics of the CopyArea operation precludes us from swapping the underlying buffers around, making it more expensive than strictly necessary.
The Swap extension fills those needs. Because the DRI3 extension provides an X pixmap reference to the direct rendered content, the Swap extension doesn’t need any new object types for its operation. Instead, it talks strictly about core X objects, using X pixmaps as the source of the new data and X drawables as the destination.
The core of the Swap extension is one request — SwapRegion. This request moves pixels from a pixmap to a drawable. It uses an X fixes Region object to specify the area of the destination being painted, and an offset within the source pixmap to align the two areas.
A bunch of data are included in the reply from the SwapRegion request. First, you get a 64-bit sequence number identifying the swap itself. Then, you get a suggested geometry for the next source pixmap. Using the suggested geometry may result in performance improvements from the techniques described in the LCA talk above.
The last bit of data included in the SwapRegion reply is a list of pixmaps which were used as source operands to earlier SwapRegion requests to the same drawable. Each pixmap is listed along with the 64-bit sequence number associated with an earlier SwapRegion operation which resulted in the contents which the pixmap now contains. Ok, so that sounds really confusing. Some examples are probably necessary.
If the SwapRegion operation was implemented by copying data out of the source pixmap into the destination drawable, then the idle swap count will be equal to the swap count from this SwapRegion operation.
If the SwapRegion operation was implemented by swapping the destination contents with the source contents, then the idle swap count will be equal to the previous swap count on the destination drawable.
I’m hoping you’ll be able to tell that in both cases, the idle swap count tries to name the swap sequence at which time the destination drawable contained the contents currently in the pixmap.
Note that even if the SwapRegion is implemented as a Copy operation, the provided source pixmap may not be included in the idle list as the copy may be delayed to meet the synchronization requirements specfied by the client.
Finally, if you want to throttle rendering based upon when frames appear on the screen, Swap offers an event that can be delivered to the drawable after the operation actually takes place.
Because the Swap extension needs to supply all of the OpenGL SwapBuffers semantics (including a multiplicity of OpenGL extensions related to that), I’ve stolen a handful of DRI2 requests to provide the necessary bits for that:
- SwapGetMSC
- SwapWaitMSC
- SwapWaitSBC
These work just like the DRI2 requests of the same names.
Current State of the Extensions
Both of these extensions have an initial protocol specification written down and stored in git:
FD passing for DRI.Next
Using the DMA-BUF interfaces to pass DRI objects between the client and server, as discussed in my previous blog posting on DRI-Next, requires that we successfully pass file descriptors over the X protocol socket.
Rumor has it that this has been tried and found to be difficult, and so I decided to do a bit of experimentation to see how this could be made to work within the existing X implementation.
(All of the examples shown here are licensed under the GPL, version 2 and are available from git://keithp.com/git/fdpassing)
Basics of FD passing
The kernel internals that support FD passing are actually quite simple — POSIX already require that two processes be able to share the same underlying reference to a file because of the semantics of the fork(2) call. Adding some ability to share arbitrary file descriptors between two processes then is far more about how you ask the kernel than the actual file descriptor sharing operation.
In Linux, file descriptors can be passed through local network sockets. The sender constructs a mystic-looking sendmsg(2) call, placing the file descriptor in the control field of that operation. The kernel pulls the file descriptor out of the control field, allocates a file descriptor in the target process which references the same file object and then sticks the file descriptor in a queue for the receiving process to fetch.
The receiver then constructs a matching call to recvmsg that provides a place for the kernel to stick the new file descriptor.
A helper API for testing
I first write a stand-alone program that created a socketpair, forked and then passed an fd from the parent to the child. Once that was working, I decided that some short helper functions would make further testing a whole lot easier.
Here’s a function that writes some data and an optional file descriptor:
ssize_t
sock_fd_write(int sock, void *buf, ssize_t buflen, int fd)
{
ssize_t size;
struct msghdr msg;
struct iovec iov;
union {
struct cmsghdr cmsghdr;
char control[CMSG_SPACE(sizeof (int))];
} cmsgu;
struct cmsghdr *cmsg;
iov.iov_base = buf;
iov.iov_len = buflen;
msg.msg_name = NULL;
msg.msg_namelen = 0;
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
if (fd != -1) {
msg.msg_control = cmsgu.control;
msg.msg_controllen = sizeof(cmsgu.control);
cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_len = CMSG_LEN(sizeof (int));
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
printf ("passing fd %d\n", fd);
*((int *) CMSG_DATA(cmsg)) = fd;
} else {
msg.msg_control = NULL;
msg.msg_controllen = 0;
printf ("not passing fd\n");
}
size = sendmsg(sock, &msg, 0);
if (size < 0)
perror ("sendmsg");
return size;
}
And here’s the matching receiver function:
ssize_t
sock_fd_read(int sock, void *buf, ssize_t bufsize, int *fd)
{
ssize_t size;
if (fd) {
struct msghdr msg;
struct iovec iov;
union {
struct cmsghdr cmsghdr;
char control[CMSG_SPACE(sizeof (int))];
} cmsgu;
struct cmsghdr *cmsg;
iov.iov_base = buf;
iov.iov_len = bufsize;
msg.msg_name = NULL;
msg.msg_namelen = 0;
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_control = cmsgu.control;
msg.msg_controllen = sizeof(cmsgu.control);
size = recvmsg (sock, &msg, 0);
if (size < 0) {
perror ("recvmsg");
exit(1);
}
cmsg = CMSG_FIRSTHDR(&msg);
if (cmsg && cmsg->cmsg_len == CMSG_LEN(sizeof(int))) {
if (cmsg->cmsg_level != SOL_SOCKET) {
fprintf (stderr, "invalid cmsg_level %d\n",
cmsg->cmsg_level);
exit(1);
}
if (cmsg->cmsg_type != SCM_RIGHTS) {
fprintf (stderr, "invalid cmsg_type %d\n",
cmsg->cmsg_type);
exit(1);
}
*fd = *((int *) CMSG_DATA(cmsg));
printf ("received fd %d\n", *fd);
} else
*fd = -1;
} else {
size = read (sock, buf, bufsize);
if (size < 0) {
perror("read");
exit(1);
}
}
return size;
}
With these two functions, I rewrote the simple example as follows:
void
child(int sock)
{
int fd;
char buf[16];
ssize_t size;
sleep(1);
for (;;) {
size = sock_fd_read(sock, buf, sizeof(buf), &fd);
if (size <= 0)
break;
printf ("read %d\n", size);
if (fd != -1) {
write(fd, "hello, world\n", 13);
close(fd);
}
}
}
void
parent(int sock)
{
ssize_t size;
int i;
int fd;
fd = 1;
size = sock_fd_write(sock, "1", 1, 1);
printf ("wrote %d\n", size);
}
int
main(int argc, char **argv)
{
int sv[2];
int pid;
if (socketpair(AF_LOCAL, SOCK_STREAM, 0, sv) < 0) {
perror("socketpair");
exit(1);
}
switch ((pid = fork())) {
case 0:
close(sv[0]);
child(sv[1]);
break;
case -1:
perror("fork");
exit(1);
default:
close(sv[1]);
parent(sv[0]);
break;
}
return 0;
}
Experimenting with multiple writes
I wanted to know what would happen if multiple writes were made, some with file descriptors and some without. So I changed the simple example parent function to look like:
void
parent(int sock)
{
ssize_t size;
int i;
int fd;
fd = 1;
size = sock_fd_write(sock, "1", 1, -1);
printf ("wrote %d without fd\n", size);
size = sock_fd_write(sock, "1", 1, 1);
printf ("wrote %d with fd\n", size);
size = sock_fd_write(sock, "1", 1, -1);
printf ("wrote %d without fd\n", size);
}
When run, this demonstrates that the reader gets two bytes in the first read along with a file descriptor followed by one byte in a second read, without a file descriptor. This demonstrates that a file descriptor message forms a barrier within the socket; multiple messages will be merged together, but not past a message containing a file descriptor.
Reading without accepting a file descriptor
What happens when the reader isn’t expecting a file descriptor? Does it just get lost? Does the reader not get the message until it asks for the file descriptor? What about the boundary issue described above?
Here’s my test case:
void
child(int sock)
{
int fd;
char buf[16];
ssize_t size;
sleep(1);
size = sock_fd_read(sock, buf, sizeof(buf), NULL);
if (size <= 0)
return;
printf ("read %d\n", size);
size = sock_fd_read(sock, buf, sizeof(buf), &fd);
if (size <= 0)
return;
printf ("read %d\n", size);
if (fd != -1) {
write(fd, "hello, world\n", 13);
close(fd);
}
}
void
parent(int sock)
{
ssize_t size;
int i;
int fd;
fd = 1;
size = sock_fd_write(sock, "1", 1, 1);
printf ("wrote %d without fd\n", size);
size = sock_fd_write(sock, "1", 1, 2);
printf ("wrote %d with fd\n", size);
}
This shows that the first passed file descriptor is picked up by the first sockfdread call, but the file descriptor is closed. The second file descriptor passed is picked up by the second sockfdread call.
Zero-length writes
Can a file descriptor be passed without sending any data?
void
parent(int sock)
{
ssize_t size;
int i;
int fd;
fd = 1;
size = sock_fd_write(sock, "1", 1, -1);
printf ("wrote %d without fd\n", size);
size = sock_fd_write(sock, NULL, 0, 1);
printf ("wrote %d with fd\n", size);
size = sock_fd_write(sock, "1", 1, -1);
printf ("wrote %d without fd\n", size);
}
And the answer is clearly “no” — the file descriptor is not passed when no data are included in the write.
A summary of results
read and recvmsg don’t merge data across a file descriptor message boundary.
failing to accept an fd in the receiver results in the fd being closed by the kernel.
a file descriptor must be accompanied by some data.
Make X pass file descriptors
I’d like to get X to pass a file descriptor without completely rewriting the internals of both the library and the X server. Ideally, without making any changes to the existing code paths for regular request processing at all.
On the sending side, this seems pretty straightforward — we just need to get the X connection file descriptor and call sendmsg directly, passing the desired file descriptor along. In XCB, this could be done by using the xcbtakesocket interface to temporarily hijack the protocol as Xlib does.
It’s the receiving side where things are messier. Because a bare read will discard any delivered file descriptor, we must make sure to use recvmsg whenever we want to actually capture the file descriptor.
Kludge X server fd receiving
Because a passed fd creates a barrier in the bytestream, when the X server reads requests from a client, the read will stop sending data after the message with the file descriptor is consumed.
Of course, this process consumes the passed file descriptor, and if that call isn’t made with recvmsg set up to receive it, the fd will be lost.
As a simple kludge, if we pass a meaningless fd with the X request and then the ‘real’ fd with a following XNoOperation request, the existing request reading code will get the request, discard the meaningless fd and then stop reading at that point due to the barrier. Once into the request processing code, recvmsg can be called to get the real file descriptor and the associated XNoOperation request.
I wrote a test for this that demonstrates how this works:
static void
child(int sock)
{
uint8_t xreq[1024];
uint8_t xnop[4];
uint8_t req;
int i, reqlen;
ssize_t size, fdsize;
int fd = -1, *fdp;
int j;
sleep (1);
for (j = 0;; j++) {
size = sock_fd_read(sock, xreq, sizeof (xreq), NULL);
printf ("got %d\n", size);
if (size == 0)
break;
i = 0;
while (i < size) {
req = xreq[i];
reqlen = xreq[i+1];
i += reqlen;
switch (req) {
case 0:
break;
case 1:
if (i != size) {
fprintf (stderr, "Got fd req, but not at end of input %d < %d\n",
i, size);
}
fdsize = sock_fd_read(sock, xnop, sizeof (xnop), &fd);
if (fd == -1) {
fprintf (stderr, "no fd received\n");
} else {
FILE *f = fdopen (fd, "w");
fprintf(f, "hello %d\n", j);
fflush(f);
fclose(f);
close(fd);
fd = -1;
}
break;
case 2:
fprintf (stderr, "Unexpected FD passing req\n");
break;
}
}
}
}
int
tmp_file(int j) {
char name[64];
sprintf (name, "tmp-file-%d", j);
return creat(name, 0666);
}
static void
parent(int sock)
{
uint8_t xreq[32];
uint8_t xnop[4];
int i, j;
int fd;
for (j = 0; j < 4; j++) {
/* Write a bunch of regular requests */
for (i = 0; i < 8; i++) {
xreq[0] = 0;
xreq[1] = sizeof (xreq);
sock_fd_write(sock, xreq, sizeof (xreq), -1);
}
/* Write our 'pass an fd' request with a 'useless' FD to block the receiver */
xreq[0] = 1;
xreq[1] = sizeof(xreq);
sock_fd_write(sock, xreq, sizeof (xreq), 1);
/* Pass an fd */
xnop[0] = 2;
xnop[1] = sizeof (xnop);
fd = tmp_file(j);
sock_fd_write(sock, xnop, sizeof (xnop), fd);
close(fd);
}
}
Fixing XCB to receive file descriptors
Multiple threads may be trying to get replies and events back from the X server at the same time, which means the kludge of having the real fd follow the message will likely lead to the wrong thread getting the file descriptor.
Instead, I suspect the best plan will be to fix XCB to internally capture passed file descriptors and save them with the associated reply. Because the file descriptor message will form a barrier in the read stream, xcb can associate any received file descriptor with the last reply in the read data. The X server would then send the reply with an explicit sendmsg call to pass both reply and file descriptor together.
Next steps
The next thing to do is code up a simple fd passing extension and try to get it working, passing descriptors back and forth to the X server. Once that works, design of the rest of the DRM-Next extension should be pretty straightforward.
Thoughts about DRI.Next
On the way to the X Developer’s Conference in Nuremberg, Eric and I chatted about how the DRI2 extension wasn’t really doing what we wanted. We came up with some fairly rough ideas and even held an informal “presentation” about it.
We didn’t have slides that day, having come up with the content for the presentation in the hours just before the conference started. This article is my attempt to capture both that discussion and further conversations held over roast pork dinners that week.
A brief overview of DRI2
Here’s a list of the three things that DRI2 currently offers.
Application authentication.
The current kernel DRM authentication mechanism restricts access to the GPU to applications connected to the DRM master. DRI2 implements this by having the application request the DRM cookie from the X server which can then be passed to the kernel to gain access to the device.
This is fairly important because once given access to the GPU, an application can access any flink’d global buffers in the system. Given that the application sends screen data to the X server using flink’d buffers, that means all screen data is visible to any GPU-accessing application. This bypasses any GPU hardware access controls.
Allocating buffers.
DRI2 defines a set of ‘attachment points’ for buffers which can be associated with an X drawable. An application needing a specific set of buffers for a particular rendering operation makes a request of the X server which allocates the buffers and passes back their flink names.
The server automatically allocates new buffers when window sizes change, sending an event to the application so that it knows to request the new buffers at some point in the future.
Presenting data to the user.
The original DRI2 protocol defined only the DRI2CopyRegion request which copied data between the allocated buffers. SwapBuffers was implemented by simply copy data from the back buffer to the front buffer. This didn’t provide any explicit control over frame synchronization, so a new request, DRI2SwapBuffers, was added to expose controls for that. This new request only deals with the front and back buffers, and either copies from back to front or exchanges those two buffers.
Along with DRI2SwapBuffers, there are new requests that wait for various frame counters and expose those to GL applications through the OMLsynccontrol extension
What’s wrong with DRI2?
DRI2 fixed a lot of the problems present with the original DRI extension, and made reliable 3D graphics on the Linux desktop possible. However, in the four years since it was designed, we’ve learned a lot, and the graphics environment has become more complex. Here’s a short list of some DRI2 issues that we’d like to see fixed.
InvalidateBuffers events. When the X window size changes, the buffers created by the X server for rendering must change size to match. The problem is that the client is presumably drawing to the old buffers when the new ones are allocated. Delivering an event to the client is supposed to make it possible for the client to keep up, but the reality is that the event is delivered at some random time to some random thread within the application. This leads to general confusion within the application, and often results in a damaged frame on the screen. Fortunately, applications tend to draw their contents often, so the damaged frame only appears briefly.
No information about new back buffer contents. When a buffer swap happens and the client learns about the new back buffer, the back buffer contents are always undefined. For most applications, this isn’t a big deal as they’re going to draw the whole window. However, compositing managers really want to reduce rendering by only repainting small damaged areas of the window. Knowing what previous frame contents are present in the back buffer allows the compositing manager to repaint just the affected area.
Un-purgable stale buffers. Between the X server finishing with a buffer and the client picking it up for a future frame, we don’t need to save the buffer contents and should mark the buffer as purgable. With the current DRI2 protocols, this can’t be done, which leaves all of those buffers hanging around in memory.
Driver-specific buffers. The DRI2 buffer handles are device specific, and so we can’t use buffers from other devices on the screen. External video encoders/cameras/encoders can’t be used with the DRI2 extension.
GEM flink has lots of issues. The flink names are global, allowing anyone with access to the device to access the flink data contents. There is also no reference to the underlying object, so the X server and client must carefully hold references to GEM objects during various operations.
Proposed changes for DRI.Next
Given the three basic DRI2 operations (authentication, allocation, presentation), how can those be improved?
Eliminate DRI/DRM magic-cookie based authentication
Kristian Høgsberg, Martin Peres, Timothée Ravier & Daniel Vetter gave a talk on DRM2 authentication at XDC this year that outlined the problems with the current DRM access control model and proposed some fairly simple solutions, including using separate device nodes—one for access to the GPU execution environment and a separate, more tightly controlled one, for access to the display engine.
Combining that with the elimination of flink for communicating data between applications and there isn’t a need for the current magic-cookie based authentication mechanism; simple file permissions should suffice to control access to the GPU.
Of course, this ignores the whole memory protection issue when running on a GPU that doesn’t provide access control, but we already have that problem today, and this doesn’t change that, other than to eliminate the global uncontrolled flink namespace.
Allocate all buffers in the application
DRI2 does buffer allocation in the X server. This ensures that that multiple (presumably cooperating) applications drawing to the same window will see the same buffers, as is required by the GLX extension. We suspected that this wasn’t all that necessary, and it turns out to have been broken several years ago. This is the traditional way in X to phase out undesirable code, and provides an excellent opportunity to revisit the original design.
Doing buffer allocations within the client has several benefits:
No longer need DRI2 additions to manage new GL buffers. Adding HiZ to the intel driver required new DRI2 code in the X server, even though X wasn’t doing anything with those buffers at all.
Eliminate some X round trips currently required for GL buffer allocation.
Knowing what’s in each buffer. Because the client allocates each buffer, it can track the contents of them.
Size tracking is trivial. The application sends the GL the of the viewport, and the union of all viewports should be the same as the size of the window (or there will be undefined contents on the screen). The driver can use the viewport information to size the buffers and ensure that every frame on the screen is complete.
Present buffers through DMA-buf
The new DMA-buf infrastructure provides a cross-driver/cross-process mechanism for sharing blobs of data. DMA-buf provides a way to take a chunk of memory used by one driver and pass it to another. It also allows applications to create file descriptors that reference these objects.
For our purposes, it’s the file descriptor which is immediately useful. This provides a reliable and secure way to pass a reference from an underlying graphics buffer from the client to the X server by sending the file descriptor over the local X socket.
An additional benefit is that we get automatic integration of data from other devices in the system, like video decoders or non-primary GPUs. The ‘Prime’ support added in DRI version 2.8 hacks around this by sticking a driver identifier in the driverType value.
Once the buffer is available to the X server, we can create a request much like the current DRI2SwapBuffers request, except instead of implicitly naming the back and front buffers, we can pass an arbitrary buffer and have those contents copied or swapped to the drawable.
We also need a way to copy a region into the drawable. I don’t know if that needs the same level of swap control, but it seems like it would be nice. Perhaps the new SwapBuffers request could take a region and offset as well, copying data when swapping isn’t possible.
Managing buffer allocations
One trivial way to use this new buffer allocation mechanism would be to have applications allocate a buffer, pass it to the X server and then simply drop their reference to it. The X server would keep a reference until the buffer was no longer in use, at which point the buffer memory would be reclaimed.
However, this would eliminate a key optimization in current drivers— the ability to re-use buffers instead of freeing and allocating new ones. Re-using buffers takes advantage of the work necessary to setup the buffer, including constructing page tables, allocating GPU memory space and flushing caches.
Notifying the application of idle buffers
Once the X server is finished using a buffer, it needs to notify the application so that the buffer can be re-used. We could send these notifications in X events, but that ends up in the twisty mess of X client event handling which has already caused so much pain with Invalidate events. The obvious alternative is to send them back in a reply. That nicely controls where the data are delivered, but causes the application to block waiting for the X server to send the reply.
Fortunately, applications already want to block when swapping buffers so that they get throttled to the swap buffers rate. That is currently done by having them wait for the DRI2SwapBuffers reply. This provides a nice place to stick the idle buffer data. We can simply list buffers which have become idle since the last SwapBuffers reply was delivered.
Releasing buffer memory
Applications which update only infrequently end up with a back buffer allocated after their last frame which can’t be freed by the system. The fix for this is to mark the buffer purgable, but that can only be done after all users of the buffer are finished with it.
With this new buffer management model, the application effectively passes ownership of its buffers to the X server, and the X server knows when all use of the buffer are finished. It could mark buffers as purgable at that point. When the buffer was sent back in the SwapBuffers reply, the application would be able to ask the kernel to mark it un-purgable again.
A new extension? Or just a new DRI2 version?
If we eliminate the authentication model and replace the buffer allocation and presentation interfaces, what of the existing DRI2 protocol remains useful? The only remaining bits are the other synchronization requests: DRI2GetMSC, DRI2WaitMSC, DRI2WaitSBC and DRI2SwapInterval.
Given this, does it make more sense to leave DRI2 as it is and plan on deprecating, and eventually eliminating, it?
Doing so would place a support burden on existing applications, as they’d need to have code to use the right extension for the common requests. They’ll already need to support two separate buffer management versions though, so perhaps this burden isn’t that onerous?
Getting Hotplugging working with a DisplayLink USB to DVI adapter
I merged Dave Airlie’s randr provider patches to the server and pushed them out in preparation for freezing the X server for the version 1.13 release. Before freezing, I figured I should at least test hotplugging my DisplayLink adapter. Well, that took all day…
Sources
For the impatient, here’s where all of the bits are. These work on my machine.
Upstream bits:
- randrproto. git://anongit.freedesktop.org/git/xorg/proto/randrproto master
Bits from Dave Airlie’s trees:
libdrm. git://people.freedesktop.org/~airlied/drm.git prime
libXrandr. git://people.freedesktop.org/~airlied/libXrandr.git prime
xrandr. git://people.freedesktop.org/~airlied/xrandr.git prime
xf86-video-intel. git://people.freedesktop.org/~airlied/xf86-video-intel prime
Bits from my trees:
xf86-video-modesetting. git://people.freedesktop.org/~keithp/xf86-video-modesetting prime
kernel. git://people.freedesktop.org/~keithp/linux prime
X server. git://people.freedesktop.org/~keithp/xserver master
Kernel adventures
The 3.5-rc6 bits have all of the necessary driver changes to support DisplayLink hot-plug. At least, they do if you’re not running 32-bit user space atop a 64-bit kernel. If you are, then most of the DRM drivers don’t work at all—they’re missing the ‘.compat_ioctl’ entry in the file_operations structure, which makes all ioctl calls return -ENOTTY. In particular, the udl driver is missing this entry, so when you plug in the device, the server tries to talk to it and nothing works.
X server adventures
Once I got the kernel working, I discovered some nice crashing behaviour in the X server. If I started the server with the DisplayLink device plugged in, as soon as I tried to enable the output, the server would segfault. Turns out this cases requires special code in the server to ensure that the DisplayLink screen privates are adjust correctly while the intel driver loads.
Modesetting driver fixup
The modesetting driver was missing a rename that happened to a structure member in the last couple of days. I’d bet Dave has the same patch sitting on a disk somewhere.
xrandr changes
The command line additions to xrandr are ‘sufficient’ to get this stuff working, but I think it could use some improvements to make the interface a bit friendlier. In particular, requiring you to use XIDs to identify providers is a bit harsh.
Schedules
I’ll be pushing the X server changes out this evening; review would be appreciated, but I don’t think any of the patches I made there are scary. I sent the kernel patches necessary to lkml, but with Dave in transit, I’m not sure who is minding the store.
As for the rest of the bits, they’re sitting in Dave’s repositories, presumably they’ll get pushed upstream soon.
But, does it actually work?
Almost. Everything says it’s working, but I’m not getting any signal to my DVI monitor. My DisplayLink device is ancient, and the kernel complains about it, saying
[drm:udl_parse_vendor_descriptor] *ERROR* Unrecognized vendor firmware descriptor".
So close, and yet…
Fixing the Sandbrige MacBook Air display initialization
We left our hero with Debian installed on the MacBook Air, but with the display getting scrambled as soon as the i915 driver loaded.
As was reported to Matthew, the problem is as simple as a lack of the right mode for the eDP panel in the machine. This mode is supposed to come from the panel EDID data, but for some reason the driver wasn’t able to query the EDID data, and so it decided to try some random panel timings it dug out of the VBT tables, which are generally supposed to be used by LVDS panels. Apple helpfully stuck valid data there, but for some other panel — one that is 1280x800 pixels instead of the 1366x768 pixel panel in the MacBook Air.
I heard rumors that some machines would get a black screen when the i915 driver loaded. I was fortunate — my machine simply displayed a 1366x768 subset of the programmed 1280x800 mode. A bit of garbage on the right side, and a few scanlines missing at the bottom. Quite workable, especially after I ran fbset -yres 768 to keep the console in the visible portion of the screen.
DDC failure
Looking through the kernel logs, the Intel driver tries to access the EDID data and times out, as if DDC is just completely broken. This is rather unexpected; the eDP spec says that the panel is required to support DDC and provide EDID. Now, we’ve seen a lot of panels which don’t quite live up to the rigorous eDP specifications, but it’s a bit surprising from Apple, who generally do VESA stuff pretty well.
We’ve heard reports about panels reporting invalid EDID data, or EDID data which didn’t actually match the panel (causing us to prefer the VBT data on LVDS machines). But I’ve not heard of an eDP panel which didn’t have anything hanging off of the DDC channel.
But X works fine?
During early debugging, I happened to start X up. Much to my surprise, X came up at the native 1366x768 mode. Digging through the kernel logs after that, I discovered that EDID was successfully fetched from the eDP panel while X started up.
At this point, I knew it was all downhill — the EDID data was present, it just wasn’t getting picked up during the early part of the driver initialization when the console mode is initialized.
eDP power management
The CPU is given complete control over the power management of the eDP panel; sequencing through various power states and waiting appropriate amounts of time when things change. Given the goal of keeping power usage as low as possible, this makes a huge amount of sense.
The eDP spec is quite clear though, without power, the panel will not respond to anything over the aux channel, and that includes EDID data. The eDP panel power hardware in the Sandybridge chip has a special mode for dealing with this requirement. If the panel is not displaying data, you can supply power for the aux channel stuff by setting a magic bit in the panel power registers.
When initializing the frame buffer, the kernel driver turns off the panel completely so that it has all of the hardware in a known state (yeah, this is not optimal, but that’s another bug). When X started, the panel was already running with the console mode.
Given the difference between these two states — EDID querying with the panel off failed, while EDID querying with the panel on worked, it seemed pretty clear that the panel power wasn’t getting managed correctly. So, it seemed pretty clear that the magic ‘power the panel’ bit wasn’t getting turned on at the right times.
Getting the power turned on.
I stuck a check inside all of the aux channel communication functions to see where things were broken. This pointed out several places missing the panel power calls. This wasn’t quite sufficient to get EDID data flowing. The remaining problem was that the code wasn’t waiting long enough after turning the panel power on before starting the aux channel communication. A few msleep calls and huzzah! EDID at boot time and the console had the right mode.
Making it faster
However, it turns out that the driver does this a lot, and the msleeps required were fairly long — the eDP panel wants a 500ms delay from turning the panel power off before you can turn it back on.
I fix this by simply delaying the panel power off until things were idle for a ‘long’ time. Now mode setting goes zipping through, and a few seconds later, the bit to force panel power on gets turned off.
Getting these bits for yourself
I’ve pushed out the code to my (temporary) kernel repository git://people.freedesktop.org/~keithp/linux in the fix-edp-vdd-power branch. I’d love to hear if you’ve tried this on either a MacBook Air or any other eDP machine from Ironlake onwards.
Installing Linux on Sandybridge MacBook Air
Matthew Garrett got a report from someone who bought a brand new Sandybridge MacBook Air and was trying to install Fedora on it. The screen went black as soon as the i915 driver tried to set the initial mode.
I bought one to try and help out. Just getting Linux installed turned out to be a minor adventure and I thought I’d write a few notes for the next person who comes along and tries to do this.
The obvious method
I downloaded a collection of Debian ISO images, live images, netboot images from testing and CD-1 of squeeze. I downloaded a Fedora 16 alpha XFCE live image. I burned all of these to actual CDs, and then stuck them in a CD drive and held the ‘c’ key while booting the machine.
Nothing worked. The CD would spin up, the screen would switch from white to a black text mode with a blinking cursor at the upper left and then then CD would stop.
rEFIt attempts
rEFIt provides a boot selection menu and various configuration bits to provide for a multi-boot environment. This is necessary to leave the OS-X install on the disk and also install Linux.
It also lets you boot from removable media, or at least that was the story. rEFIt would list removable USB flash storage devices, but it would not list CD drives, even after the CD drive spun up and did stuff for a while. This makes me wonder if the ‘c’ key technique wasn’t actually trying to boot from the CD.
However, rEFIt wouldn’t boot from a grub2-enabled USB key.
ISO on USB flash key
What rEFIt will boot is a Fedora or Debian ISO image copied directly to the USB flash device. The Ubuntu 11.04 live image and it did not work.
Fedora F16 Alpha
The live image came up and started X in frame buffer mode. However, when I went to install the image to the hard disk, it got stuck trying to discover devices. I waited about 20 minutes and it never finished, although it did chew up a lot of CPU time.
Debian Squeeze
I copied CD-1 of the 32-bit Squeeze distribution to my flash key. That booted up to text mode and went through the install. Nothing really unusual until it came to installing grub. I let it install grub to /dev/sda4, the new partition I had created for Linux. I figured installing grub to /dev/sda would be a bad idea (I think it would have actually worked just fine in retrospect).
A slight out-of-sync diversion here — I couldn’t get the 32-bit kernel to talk to any keyboard or USB networking device, so I eventually installed a 64-bit kernel and that works just fine. You can run a 32-bit user space with a 64-bit kernel without any trouble.
Making EFI boot
Naturally, I couldn’t get the machine to boot to Linux after this. rEFIt, the multi-boot tool for Mac OS X appears only interested in booting EFI images, or at least was not interested in booting my BIOS-based Grub2 installation on /dev/sda4.
I used the ‘rescue’ mode on the USB flash drive to get back to my installed Linux image and installed grub-efi-amd64. The OS X EFI version does not support 32-bit EFI code, which seems entirely reasonable to me.
What took a while was figuring out that all of the grub files need to be readable through EFI, and that the Mac EFI code can only read FAT (and HFS?) partitions.
So, the trick appears to be to stick the grub files on the Mac OS X efi partition (/dev/sda1). I created a directory over there, /EFI/grub and then mounted that as /efi and made a symlink from /boot/grub to /efi/EFI/grub. Now, grub-install sticks the files in the right place, and update-grub will even place grub.cfg in the directory where it needs to go.
At this point, rEFIt will happily find the core.efi and grub.efi files and show them in the list of possible operating systems to boot. I don’t know why there are both core.efi and grub.efi files; I’ve only ever picked ‘grub.efi’ and that has worked nicely.
Broadcom WiFi
I haven’t gotten that working yet; the b43 supported chip list says it works in kernel 3.1. I’m busy building a 3.1-rc4 kernel as I type this, so perhaps that will ‘just work’ when I boot that.
i915 KMS
The whole reason for doing the install is to fix problems with the eDP EDID discovery which is currently preventing the machine from running X. I’ll be working on that next; a quick kludge to just read the current mode from the device should get something on the screen (we already do that for LVDS panels). Beyond that, I’ll be trying to figure out whether I can make EDID actually work, or whether it’s just not available in this machine.
Somehow the kernel frame buffer ended up as 1280x800, which was quite annoying. As a temporary kludge, I ran fbset -yres 768 to at least avoid having text disappear off the bottom of the screen. Once the panel size is discovered correctly, that should be fixed.
Status
This is sure not like installing Linux on a regular PC. However, I’m reasonably hopeful that the devices in the machine will be working pretty well under Debian within the next few days.