Reworking Intel Glamor
The original Intel driver Glamor support was based on the notion that it would be better to have the Intel driver capture any fall backs and try to make them faster than Glamor could do internally. Now that Glamor has reasonably complete acceleration, and its fall backs aren't terrible, this isn't as useful as it once was, and because this uses Glamor in a weird way, we're making the Glamor code harder to maintain.
Fixing the Intel driver to not use Glamor in this way took a bit of effort; the UXA support is all tied into the overall operation of the driver.
Separating out UXA functions
The first task was to just identify which functions were UXA-specific by adding "_uxa" to their names. A couple dozen sed runs and now a bunch of the driver is looking better.
Next, a pile of UXA-specific functions were actually inside the non-UXA parts of the code. Those got moved out, and a new 'intel_uxa.h" file was created to hold all of the definitions.
Finally, a few non UXA-specific functions were actually in the uxa files; those got moved over to the generic code.
Removing the Glamor paths in UXA
Each one of the UXA functions had a little piece of code at the top like:
if (uxa_screen->info->flags & UXA_USE_GLAMOR) {
int ok = 0;
if (uxa_prepare_access(pDrawable, UXA_GLAMOR_ACCESS_RW)) {
ok = glamor_fill_spans_nf(pDrawable,
pGC, n, ppt, pwidth, fSorted);
uxa_finish_access(pDrawable, UXA_GLAMOR_ACCESS_RW);
}
if (!ok)
goto fallback;
return;
}
Pulling those out shrank the UXA code by quite a bit.
Selecting Acceleration (or not)
The intel driver only supported UXA before; Glamor was really just a slightly different mode for UXA. I switched the driver from using a bit in the UXA flags to having an 'accel' variable which could be one of three options:
- ACCEL_GLAMOR.
- ACCEL_UXA.
- ACCEL_NONE
I added ACCEL_NONE to give us a dumb frame buffer mode. That actually supports DRI3 so that we can bring up Mesa and run it under X before we have any acceleration code ready; avoiding a dependency loop when doing new hardware. All that it requires is a kernel that offers mode setting and buffer allocation.
Initializing Glamor
With UXA no longer supporting Glamor, it was time to plug the Glamor support into the top of the driver. That meant changing a bunch of the entry points to select appropriate Glamor or UXA functionality, instead of just calling into UXA. So, now we've got lots of places that look like:
switch (intel->accel) {
#if USE_GLAMOR
case ACCEL_GLAMOR:
if (!intel_glamor_create_screen_resources(screen))
return FALSE;
break;
#endif
#if USE_UXA
case ACCEL_UXA:
if (!intel_uxa_create_screen_resources(screen))
return FALSE;
break;
#endif
case ACCEL_NONE:
if (!intel_none_create_screen_resources(screen))
return FALSE;
break;
}
Using a switch means that we can easily elide code that isn't wanted in a particular build. Of course 'accel' is an enum, so places which are missing one of the required paths will cause a compiler warning.
It's not all perfectly clean yet; there are piles of UXA-only paths still.
Making It Build Without UXA
The final trick was to make the driver build without UXA turned on; that took several iterations before I had the symbols sorted out appropriately.
I built the driver with various acceleration options and then tried to count the lines of source code. What I did was just list the source files named in the driver binary itself. This skips all of the header files and the render program source code, and ignores the fact that there are a bunch of #ifdef's in the uxa directory selecting between uxa, glamor and none.
Accel Lines Size(B)
----------- ------ -------
none 7143 73039
glamor 7397 76540
uxa 25979 283777
sna 118832 1303904
none legacy 14449 152480
glamor legacy 14703 156125
uxa legacy 33285 350685
sna legacy 126138 1395231
The 'legacy' addition supports i810-class hardware, which is needed for a complete driver.
Along The Way, Enable Tiling for the Front Buffer
While hacking the code, I discovered that the initial frame buffer allocated for the screen was created without tiling because a few parameters that depend on the GTT size were not initialized until after that frame buffer was allocated. I haven't analyzed what effect this has on performance.
Page Flipping and Resize
Page flipping (or just flipping) means switching the entire display from one frame buffer to another. It's generally the fastest way of updating the screen as you don't have to copy any bits.
The trick with flipping is that a client hands you a random pixmap and you need to stuff that into the KMS API. With UXA, that's pretty easy as all pixmaps are managed through the UXA API which knows which underlying kernel BO is tied with each pixmap. Using Glamor, only the underlying GL driver knows the mapping. Fortunately (?), we have the EGL Image extension, which lets us take a random GL texture and turn it into a file descriptor for a DMA-BUF kernel object. So, we have this cute little dance:
fd = glamor_fd_from_pixmap(screen,
pixmap,
&stride,
&size);
bo = drm_intel_bo_gem_create_from_prime(intel->bufmgr, fd, size);
close(fd);
intel_glamor_get_pixmap(pixmap)->bo = bo;
That last bit remembers the bo in some local memory so we don't have to do this more than once for each pixmap. glamor_fd_from_pixmap ends up calling eglCreateImageKHR followed by gbm_bo_import and then a kernel ioctl to convert a prime handle into an fd. It's all quite round-about, but it does seem to work just fine.
After I'd gotten Glamor mostly working, I tried a few OpenGL applications and discovered flipping wasn't working. That turned out to have an unexpected consequence -- all full-screen applications would run flat-out, and not be limited to frame rate. Present 'recovers' from a failed flip queue operation by immediately performing a CopyArea; not waiting for vblank. This needs to get fixed in Present by having it re-queued the CopyArea for the right time. What I did in the intel driver was to add a bunch more checks for tiling mode, pixmap stride and other things to catch pixmaps that were going to fail before the operation was queued and forcing them to fall back to CopyArea at the right time.
The second adventure was with XRandR. Glamor has an API to fix up the screen pixmap for a new frame buffer, but that pulls the size of the frame buffer out of the pixmap instead of out of the screen. XRandR leaves the pixmap size set to the old screen size during this call; fixing that just meant getting the pixmap size set correctly before calling into glamor. I think glamor should get fixed to use the screen size rather than the pixmap size.
Painting Root before Mode set
The X server has generally done initialization in one order:
- Create root pixmap
- Set video modes
- Paint root window
Recently, we've added a '-background none' option to the X server which causes it to set the root window background to none and have the driver fill in that pixmap with whatever contents were on the screen before the X server started.
In a pre-Glamor world, that was done by hacking the video driver to copy the frame buffer console contents to the root pixmap as it was created. The trouble here is that the root pixmap is created long before the upper layers of the X server are ready for drawing, so you can't use the core rendering paths. Instead, UXA had kludges to call directly into the acceleration functions.
What we really want though is to change the order of operations:
- Create root pixmap
- Paint root window
- Set video mode
That way, the normal root window painting operation will take care of getting the image ready before that pixmap is ever used for scanout. I can use regular core X rendering to get the original frame buffer contents into the root window, and even if we're not using -background none and are instead painting the root with some other pattern (like the root weave), I get that presented without an intervening black flash.
That turned out to be really easy -- just delay the call to I830EnterVT (which sets the modes) until the server is actually running. That required one additional kludge -- I needed to tell the DIX level RandR functions about the new modes; the mode setting operation used during server init doesn't call up into RandR as RandR lists the current configuration after the screen has been initialized, which is when the modes used to be set.
Calling xf86RandR12CreateScreenResources does the trick nicely. Getting the root window bits from fbcon, setting video modes and updating the RandR/Xinerama DIX info is now all done from the BlockHandler the first time it is called.
Performance
I ran the current glamor version of the intel driver with the master branch of the X server and there were not any huge differences since my last Glamor performance evaluation aside from GetImage. The reason is that UXA/Glamor never called Glamor's image functions, and the UXA GetImage is pretty slow. Using Mesa's image download turns out to have a huge performance benefit:
1. UXA/Glamor from April
2. Glamor from today
1 2 Operation
------------ ------------------------- -------------------------
50700.0 56300.0 ( 1.110) ShmGetImage 10x10 square
12600.0 26200.0 ( 2.079) ShmGetImage 100x100 square
1840.0 4250.0 ( 2.310) ShmGetImage 500x500 square
3290.0 202.0 ( 0.061) ShmGetImage XY 10x10 square
36.5 170.0 ( 4.658) ShmGetImage XY 100x100 square
1.5 56.4 ( 37.600) ShmGetImage XY 500x500 square
49800.0 50200.0 ( 1.008) GetImage 10x10 square
5690.0 19300.0 ( 3.392) GetImage 100x100 square
609.0 1360.0 ( 2.233) GetImage 500x500 square
3100.0 206.0 ( 0.066) GetImage XY 10x10 square
36.4 183.0 ( 5.027) GetImage XY 100x100 square
1.5 55.4 ( 36.933) GetImage XY 500x500 square
Running UXA from today the situation is even more dire; I suspect that enabling tiling has made CPU reads through the GTT even worse than before?
1: UXA today
2: Glamor today
1 2 Operation
------------ ------------------------- -------------------------
43200.0 56300.0 ( 1.303) ShmGetImage 10x10 square
2600.0 26200.0 ( 10.077) ShmGetImage 100x100 square
130.0 4250.0 ( 32.692) ShmGetImage 500x500 square
3260.0 202.0 ( 0.062) ShmGetImage XY 10x10 square
36.7 170.0 ( 4.632) ShmGetImage XY 100x100 square
1.5 56.4 ( 37.600) ShmGetImage XY 500x500 square
41700.0 50200.0 ( 1.204) GetImage 10x10 square
2520.0 19300.0 ( 7.659) GetImage 100x100 square
125.0 1360.0 ( 10.880) GetImage 500x500 square
3150.0 206.0 ( 0.065) GetImage XY 10x10 square
36.1 183.0 ( 5.069) GetImage XY 100x100 square
1.5 55.4 ( 36.933) GetImage XY 500x500 square
Of course, this is all just x11perf, which doesn't represent real applications at all well. However, there are applications which end up doing more GetImage than would seem reasonable, and it's nice to have this kind of speed up.
Status
I'm running this on my crash box to get some performance numbers and continue testing it. I'll switch my desktop over when I feel a bit more comfortable with how it's working. But, I think it's feature complete at this point.
Where's the Code
As usual, the code is in my personal repository. It's on the 'glamor' branch.
git://people.freedesktop.org/~keithp/xf86-video-intel glamor