RSS Add a new post titled:

A Forest of X Server Changes

We’ve got about another month left in the X server merge window for 1.17 and I’ve written a small set of fixes which haven’t been reviewed yet for merging. I thought I’d advertise them a bit and see if I couldn’t encourage a few of you to take a look and see if they’re useful, correct and complete.

All of these are in my personal X server repository:

git://people.freedesktop.org/~keithp/xserver.git

Cleaning up the X Registry

Branch: registry-fixes

I’ll bet most of you don’t even know about this code. It serves as a database mapping various X enumerations to strings to aid in diagnostics. For the security extensions, SECURITY and XSELinux, it holds names for all of the request, event and errors in the core protocol and all registered extensions. For X-Resource, it has the names of the registered resource types.

The X registry gets the request, event and error data from a file, “protocol.txt”, which is installed in /usr/lib/xorg/protocol.txt on my machine. It gets the resource names as a part of resource type allocation.

So, what’s wrong with this? Three basic things:

  1. A simple bug — protocol.txt is left open while the server runs. This consumes a file descriptor for no good reason.

  2. protocol.txt is read and parsed even if the security extensions aren’t available. This wastes time and memory.

  3. The resource names are kept even if X-Resource isn’t in use.

The fixes remove the configure options for including the registry code; these functions are only used by the above extensions, so we can tell whether to include the code based solely on whether the extensions are being built.

Getting rid of the TCP listener by default

Branch: listen-fixes

We’ve had the ‘-nolisten’ option for a while now to disable inbound TCP connections. It’s useful for security reasons, but we’ve never enabled this by default. This patch sequence provides configure options for each of the listen sockets (tcp, unix and local), leaves unix and local enabled by default and disables tcp by default.

A new option, ‘-listen’, is added which allows the user to override the -nolisten defaults in case they actually want to use TCP connections to X.

Glamor bug fixes

branch: glamor-fixes

This branch fixes two bugs:

  1. Scale a large pixmap down to a small pixmap. This happens when you display enormous images in a web page. Iceweasel sends the whole huge image to X and uses Render to scale it to the screen. If the image is larger than a single texture, the X server splits it up into tiles, but the code which tries to perform the merged scale is just broken. Five patches fix this.

  2. Shader-based trapezoids. This code uses area coverage to compute trapezoids. That violates the Render spec, which requires point sampling. Further, the performance of these trapezoids is lower than software (by a lot). This one patch removes the code.

Present bug fixes

branch: present-fixes

A selection of small bug fixes:

  1. Clear pending flips at CloseScreen. This removes a reference to any pending flip pixmap, allowing it to be freed. Otherwise, we’ll leak memory across server reset.

  2. Add support for PresentOptionCopy. This has been in the protocol spec for a while, and was completely trivial to implement. However, it never got done. One tiny little patch.

  3. Expose the Present API to drivers via sdksyms.sh. Until now, the present extension APIs have only been available inside the X server. This exposes them to drivers. This took a few cleanup patches first.

Use Present for Glamor XV

branch: glamor-present-xv

Painting XV to the screen should be done at vblank time to avoid tearing. Present offers vblank synchronized operations. Hooking those two together required a few new present APIs to expose the vblank functionality outside of the present code, then a bit of glamor code to hook up that new API to the XV bits.

Switching Glamor to a GL core profile context

branch: glamor-core-profile

This patch set is still in progress, but demonstrates how close we are. We’ll be requiring OpenGL 3.3 for this so that we get texture swizzling, which is required for our single channel objects.

The changes present on the branch are:

  1. Switch single channel surfaces from GLALPHA to GLRED.

  2. Use vertex array objects.

  3. Switch ephyr over to using a core 3.3 profile.

Still left to do is

  1. Switch Render code to VBOs

The core code uses VBOs everywhere, but the Render code doesn’t. This means that all Render drawing fails, which makes the resulting server not very useful.

My main objective for getting this done is to reduce memory usage by about 16MB, which is the space allocated for software rendering in Mesa in case someone does something which the hardware doesn’t handle, and that can only with some legacy OpenGL APIs.

Please help out!

All of these friendly little patches are looking for a bit of review so that they can get merged before the 1.17 window closes.

Posted Mon Sep 15 10:14:16 2014 Tags:

AltOS 1.5 — EasyMega support, features and bug fixes

Bdale and I are pleased to announce the release of AltOS version 1.5.

AltOS is the core of the software for all of the Altus Metrum products. It consists of firmware for our cc1111, STM32L151, LPC11U14 and ATtiny85 based electronics and Java-based ground station software.

This is a major release of AltOS, including support for our new EasyMega board and a host of new features and bug fixes

AltOS Firmware — EasyMega added, new features and fixes

Our new flight computer, EasyMega, is a TeleMega without any radios:

  • 9 DoF IMU (3 axis accelerometer, 3 axis gyroscope, 3 axis compass).

  • Orientation tracking using the gyroscopes (and quaternions, which are lots of fun!)

  • Four fully-programmable pyro channels, in addition to the usual apogee and main channels.

AltOS Changes

We’ve made a few improvements in the firmware:

  • The APRS secondary station identifier (SSID) is now configurable by the user. By default, it is set to the last digit of the serial number.

  • Continuity of the four programmable pyro channels on EasyMega and TeleMega is now indicated via the beeper. Four tones are sent out after the continuity indication for the apogee and main channels with high tones indicating continuity and low tones indicating an open circuit.

  • Configurable telemetry data rates. You can now select among 38400 (the previous value, and still the default), 9600 or 2400 bps. To take advantage of this, you’ll need to reflash your TeleDongle or TeleBT.

AltOS Bug Fixes

We also fixed a few bugs in the firmware:

  • TeleGPS had separate flight logs, one for each time the unit was turned on. Turning the unit on to test stuff and turning it back off would consume one of the flight log ‘slots’ on the board; once all of the slots were full, no further logging would take place. Now, TeleGPS appends new data to an existing single log.

  • Increase the maximum computed altitude from 32767m to 2147483647m. Back when TeleMetrum v1.0 was designed, we never dreamed we’d be flying to 100k’ or more. Now that’s surprisingly common, and so we’ve increased the size of the altitude data values to fit modern rocketry needs.

  • Continuously evaluate pyro firing condition during delay period. The previous firmware would evaluate the pyro firing requirements, and once met, would delay by the indicated amount and then fire the channel. If the conditions had changed state, the channel would still fire. Now, the conditions are continuously evaluated during the delay period and if they change state, the event is suppressed.

  • Allow negative values in the pyro configuration. Now you can select a negative speed to indicate a descent rate or a negative acceleration value to indicate acceleration towards the ground.

AltosUI and TeleGPS — EasyMega support, OS integration and more

The AltosUI and TeleGPS applications have a few changes for this release:

  • EasyMega support. That was a simple matter of adapting the existing TeleMega support.

  • Added icons for our file types, and hooked up the file manager so that AltosUI, TeleGPS and/or MicroPeak are used to view any of our data files.

  • Configuration support for APRS SSIDs, and telemetry data rates.

Posted Sat Sep 13 11:47:42 2014 Tags:

Reworking Intel Glamor

The original Intel driver Glamor support was based on the notion that it would be better to have the Intel driver capture any fall backs and try to make them faster than Glamor could do internally. Now that Glamor has reasonably complete acceleration, and its fall backs aren’t terrible, this isn’t as useful as it once was, and because this uses Glamor in a weird way, we’re making the Glamor code harder to maintain.

Fixing the Intel driver to not use Glamor in this way took a bit of effort; the UXA support is all tied into the overall operation of the driver.

Separating out UXA functions

The first task was to just identify which functions were UXA-specific by adding “_uxa” to their names. A couple dozen sed runs and now a bunch of the driver is looking better.

Next, a pile of UXA-specific functions were actually inside the non-UXA parts of the code. Those got moved out, and a new ‘intel_uxa.h” file was created to hold all of the definitions.

Finally, a few non UXA-specific functions were actually in the uxa files; those got moved over to the generic code.

Removing the Glamor paths in UXA

Each one of the UXA functions had a little piece of code at the top like:

if (uxa_screen->info->flags & UXA_USE_GLAMOR) {
    int ok = 0;

    if (uxa_prepare_access(pDrawable, UXA_GLAMOR_ACCESS_RW)) {
        ok = glamor_fill_spans_nf(pDrawable,
                      pGC, n, ppt, pwidth, fSorted);
        uxa_finish_access(pDrawable, UXA_GLAMOR_ACCESS_RW);
    }

    if (!ok)
        goto fallback;

    return;
}

Pulling those out shrank the UXA code by quite a bit.

Selecting Acceleration (or not)

The intel driver only supported UXA before; Glamor was really just a slightly different mode for UXA. I switched the driver from using a bit in the UXA flags to having an ‘accel’ variable which could be one of three options:

  • ACCEL_GLAMOR.
  • ACCEL_UXA.
  • ACCEL_NONE

I added ACCEL_NONE to give us a dumb frame buffer mode. That actually supports DRI3 so that we can bring up Mesa and run it under X before we have any acceleration code ready; avoiding a dependency loop when doing new hardware. All that it requires is a kernel that offers mode setting and buffer allocation.

Initializing Glamor

With UXA no longer supporting Glamor, it was time to plug the Glamor support into the top of the driver. That meant changing a bunch of the entry points to select appropriate Glamor or UXA functionality, instead of just calling into UXA. So, now we’ve got lots of places that look like:

        switch (intel->accel) {
#if USE_GLAMOR
        case ACCEL_GLAMOR:
                if (!intel_glamor_create_screen_resources(screen))
                        return FALSE;
                break;
#endif
#if USE_UXA
        case ACCEL_UXA:
                if (!intel_uxa_create_screen_resources(screen))
                        return FALSE;
        break;
#endif
        case ACCEL_NONE:
                if (!intel_none_create_screen_resources(screen))
                        return FALSE;
                break;
        }

Using a switch means that we can easily elide code that isn’t wanted in a particular build. Of course ‘accel’ is an enum, so places which are missing one of the required paths will cause a compiler warning.

It’s not all perfectly clean yet; there are piles of UXA-only paths still.

Making It Build Without UXA

The final trick was to make the driver build without UXA turned on; that took several iterations before I had the symbols sorted out appropriately.

I built the driver with various acceleration options and then tried to count the lines of source code. What I did was just list the source files named in the driver binary itself. This skips all of the header files and the render program source code, and ignores the fact that there are a bunch of #ifdef’s in the uxa directory selecting between uxa, glamor and none.

    Accel                    Lines          Size(B)
    -----------             ------          -------
    none                      7143            73039
    glamor                    7397            76540
    uxa                      25979           283777
    sna                     118832          1303904

    none legacy              14449           152480
    glamor legacy            14703           156125
    uxa legacy               33285           350685
    sna legacy              126138          1395231

The ‘legacy’ addition supports i810-class hardware, which is needed for a complete driver.

Along The Way, Enable Tiling for the Front Buffer

While hacking the code, I discovered that the initial frame buffer allocated for the screen was created without tiling (!) because a few parameters that depend on the GTT size were not initialized until after that frame buffer was allocated. I haven’t analyzed what effect this has on performance.

Page Flipping and Resize

Page flipping (or just flipping) means switching the entire display from one frame buffer to another. It’s generally the fastest way of updating the screen as you don’t have to copy any bits.

The trick with flipping is that a client hands you a random pixmap and you need to stuff that into the KMS API. With UXA, that’s pretty easy as all pixmaps are managed through the UXA API which knows which underlying kernel BO is tied with each pixmap. Using Glamor, only the underlying GL driver knows the mapping. Fortunately (?), we have the EGL Image extension, which lets us take a random GL texture and turn it into a file descriptor for a DMA-BUF kernel object. So, we have this cute little dance:

fd = glamor_fd_from_pixmap(screen,
                               pixmap,
                               &stride,
                               &size);


bo = drm_intel_bo_gem_create_from_prime(intel->bufmgr, fd, size);
    close(fd);
    intel_glamor_get_pixmap(pixmap)->bo = bo;

That last bit remembers the bo in some local memory so we don’t have to do this more than once for each pixmap. glamorfdfrompixmap ends up calling eglCreateImageKHR followed by gbmbo_import and then a kernel ioctl to convert a prime handle into an fd. It’s all quite round-about, but it does seem to work just fine.

After I’d gotten Glamor mostly working, I tried a few OpenGL applications and discovered flipping wasn’t working. That turned out to have an unexpected consequence — all full-screen applications would run flat-out, and not be limited to frame rate. Present ‘recovers’ from a failed flip queue operation by immediately performing a CopyArea; not waiting for vblank. This needs to get fixed in Present by having it re-queued the CopyArea for the right time. What I did in the intel driver was to add a bunch more checks for tiling mode, pixmap stride and other things to catch pixmaps that were going to fail before the operation was queued and forcing them to fall back to CopyArea at the right time.

The second adventure was with XRandR. Glamor has an API to fix up the screen pixmap for a new frame buffer, but that pulls the size of the frame buffer out of the pixmap instead of out of the screen. XRandR leaves the pixmap size set to the old screen size during this call; fixing that just meant getting the pixmap size set correctly before calling into glamor. I think glamor should get fixed to use the screen size rather than the pixmap size.

Painting Root before Mode set

The X server has generally done initialization in one order:

  1. Create root pixmap
  2. Set video modes
  3. Paint root window

Recently, we’ve added a ‘-background none’ option to the X server which causes it to set the root window background to none and have the driver fill in that pixmap with whatever contents were on the screen before the X server started.

In a pre-Glamor world, that was done by hacking the video driver to copy the frame buffer console contents to the root pixmap as it was created. The trouble here is that the root pixmap is created long before the upper layers of the X server are ready for drawing, so you can’t use the core rendering paths. Instead, UXA had kludges to call directly into the acceleration functions.

What we really want though is to change the order of operations:

  1. Create root pixmap
  2. Paint root window
  3. Set video mode

That way, the normal root window painting operation will take care of getting the image ready before that pixmap is ever used for scanout. I can use regular core X rendering to get the original frame buffer contents into the root window, and even if we’re not using -background none and are instead painting the root with some other pattern (like the root weave), I get that presented without an intervening black flash.

That turned out to be really easy — just delay the call to I830EnterVT (which sets the modes) until the server is actually running. That required one additional kludge — I needed to tell the DIX level RandR functions about the new modes; the mode setting operation used during server init doesn’t call up into RandR as RandR lists the current configuration after the screen has been initialized, which is when the modes used to be set.

Calling xf86RandR12CreateScreenResources does the trick nicely. Getting the root window bits from fbcon, setting video modes and updating the RandR/Xinerama DIX info is now all done from the BlockHandler the first time it is called.

Performance

I ran the current glamor version of the intel driver with the master branch of the X server and there were not any huge differences since my last Glamor performance evaluation aside from GetImage. The reason is that UXA/Glamor never called Glamor’s image functions, and the UXA GetImage is pretty slow. Using Mesa’s image download turns out to have a huge performance benefit:

1. UXA/Glamor from April
2. Glamor from today

       1                 2                 Operation
------------   -------------------------   -------------------------
     50700.0        56300.0 (     1.110)   ShmGetImage 10x10 square 
     12600.0        26200.0 (     2.079)   ShmGetImage 100x100 square 
      1840.0         4250.0 (     2.310)   ShmGetImage 500x500 square 
      3290.0          202.0 (     0.061)   ShmGetImage XY 10x10 square 
        36.5          170.0 (     4.658)   ShmGetImage XY 100x100 square 
         1.5           56.4 (    37.600)   ShmGetImage XY 500x500 square 
     49800.0        50200.0 (     1.008)   GetImage 10x10 square 
      5690.0        19300.0 (     3.392)   GetImage 100x100 square 
       609.0         1360.0 (     2.233)   GetImage 500x500 square 
      3100.0          206.0 (     0.066)   GetImage XY 10x10 square 
        36.4          183.0 (     5.027)   GetImage XY 100x100 square 
         1.5           55.4 (    36.933)   GetImage XY 500x500 square

Running UXA from today the situation is even more dire; I suspect that enabling tiling has made CPU reads through the GTT even worse than before?

1: UXA today
2: Glamor today

       1                 2                 Operation
------------   -------------------------   -------------------------
     43200.0        56300.0 (     1.303)   ShmGetImage 10x10 square 
      2600.0        26200.0 (    10.077)   ShmGetImage 100x100 square 
       130.0         4250.0 (    32.692)   ShmGetImage 500x500 square 
      3260.0          202.0 (     0.062)   ShmGetImage XY 10x10 square 
        36.7          170.0 (     4.632)   ShmGetImage XY 100x100 square 
         1.5           56.4 (    37.600)   ShmGetImage XY 500x500 square 
     41700.0        50200.0 (     1.204)   GetImage 10x10 square 
      2520.0        19300.0 (     7.659)   GetImage 100x100 square 
       125.0         1360.0 (    10.880)   GetImage 500x500 square 
      3150.0          206.0 (     0.065)   GetImage XY 10x10 square 
        36.1          183.0 (     5.069)   GetImage XY 100x100 square 
         1.5           55.4 (    36.933)   GetImage XY 500x500 square

Of course, this is all just x11perf, which doesn’t represent real applications at all well. However, there are applications which end up doing more GetImage than would seem reasonable, and it’s nice to have this kind of speed up.

Status

I’m running this on my crash box to get some performance numbers and continue testing it. I’ll switch my desktop over when I feel a bit more comfortable with how it’s working. But, I think it’s feature complete at this point.

Where’s the Code

As usual, the code is in my personal repository. It’s on the ‘glamor’ branch.

git://people.freedesktop.org/~keithp/xf86-video-intel  glamor
Posted Mon Jul 21 00:39:04 2014 Tags:

AltOS 1.4.1 — Fix ups for 1.4

Bdale and I are pleased to announce the release of AltOS version 1.4.1.

AltOS is the core of the software for all of the Altus Metrum products. It consists of firmware for our cc1111, STM32L151, LPC11U14 and ATtiny85 based electronics and Java-based ground station software.

This is a minor release of AltOS, incorporating a small handful of build and install issues. No new features have been added, and the only firmware change was to make sure that updated TeleMetrum v2.0 firmware is included in this release.

AltOS — TeleMetrum v2.0 firmware included

AltOS version 1.4 shipped without updated firmware for TeleMetrum v2.0. There are a couple of useful new features and bug fixes in that version, so if you have a TeleMetrum v2.0 board with older firmware, you should download this release and update it.

AltosUI and TeleGPS — Signed Windows Drivers, faster maps downloading

We finally figured out how to get our Windows drivers signed making it easier for Windows 7 and 8 users to install our software and use our devices.

Also for Windows users, we’ve fixed the Java version detection so that if you have Java 8 already installed, AltOS and TeleGPS won’t try to download Java 7 and install that. We also fixed the Java download path so that if you have no Java installed, we’ll download a working version of Java 6 instead of using an invalid Java 7 download URL.

Finally, for everyone, we fixed maps downloading to use the authorized Google API key method for getting map tiles. This makes map downloading faster and more reliable.

Thanks for flying with Altus Metrum!

Posted Tue Jun 24 22:35:08 2014 Tags:

TeleGPS Battery Life

I charged up one of the “160mAh” batteries that we sell. (The ones we’ve got now are labeled 200mAh; the 160mAh rating is something like a minimum that we expect to be able to ever get at that size.)

I connected the battery to a TeleGPS board, hooked up a telemetry monitoring setup on my laptop and set the device in the window of my office. This let me watch the battery voltage through the day without interrupting my other work. Of course, because the telemetry was logged to a file, I’ve now got a complete plot of the voltage data:

It looks like a pretty typical lithium polymer discharge graph; slightly faster drop from the 4.1V full charge voltage down to about 3.9V, then a gradual drop to 3.65 at which point it starts to dive as the battery is nearly discharged.

Because we run the electronics at 3.3V, and the LDO has a dropout of about 100mV, it’s best if the battery stays above 3.4V. That occurred at around 21500 seconds of run time, or almost exactly six hours.

We also have an “850mAh” battery in the shop; I’d expect that to last a bit more than four times as long, or about a day. Maybe I’ll get bored enough at some point to hook one up and verify that guess.

Posted Tue Jun 17 19:23:49 2014 Tags:

AltOS 1.4 — TeleGPS support, features and bug fixes

Bdale and I are pleased to announce the release of AltOS version 1.4.

AltOS is the core of the software for all of the Altus Metrum products. It consists of firmware for our cc1111, STM32L151, LPC11U14 and ATtiny85 based electronics and Java-based ground station software.

This is a major release of AltOS, including support for our new TeleGPS board and a host of new features and bug fixes

AltOS Firmware — TeleGPS added, new features and fixes

Our new tracker, TeleGPS, works quite differently than a flight computer

  • Starts tracking and logging at power-on

  • Disables RF and logging only when connected to USB

  • Doesn’t log position when it isn’t moving for a long time.

TeleGPS transmits our digital telemetry protocol, APRS and radio direction finding beacons.

For TeleMega, we’ve made the firing time for the additional pyro channels (A-D) configurable, in case the default (50ms) isn’t long enough.

AltOS Beeping Changes

The three-beep startup tones have been replaced with a report of the current battery voltage. This is nice on all of the board, but particularly useful with EasyMini which doesn’t have the benefit of telemetry reporting its state.

We also changed the other state tones to “Farnsworth” spacing. This makes them all faster, and easier to distinguish from the numeric reports of voltage and altitude.

Finally, we’ve added the ability to change the frequency of the beeper tones. This is nice when you have two Altus Metrum flight computers in the same ebay and want to be able to tell the beeps apart.

AltOS Bug Fixes

Fixed a bug which prevented you from using TeleMega’s extra pyro channel ‘Flight State After’ configuration value.

AltOS 1.3.2 on TeleMetrum v2.0 and TeleMega would reset the flight number to 2 after erasing flights; that’s been fixed.

AltosUI — New Maps, igniter tab and a few fixes

With TeleGPS tracks now potentially ranging over a much wider area than a typical rocket flight, the Maps interface has been updated to include zooming and multiple map styles. It also now uses less memory, which should make it work on a wider range of systems.

For TeleMega, we’ve added an ‘Igniter’ tab to the flight monitor interface so you can check voltages on the extra pyro channels before pushing the button.

We’re hoping that the new Maps interface will load and run on machines with limited memory for Java applications; please let us know if this changes anything for you.

TeleGPS — All new application just for TeleGPS

While TeleGPS shares the same telemetry and data logging capabilities as all of the Altus Metrum flight computers, its use as a tracker is expected to be both broader and simpler than the rocketry-specific systems. We’ve build a custom TeleGPS application that incorporates the mapping and data visualization aspects of AltosUI, but eliminates all of the rocketry-specific flight state tracking.

Posted Sun Jun 15 19:48:15 2014 Tags:

Current Glamor Performance

I finally managed to get a Gigabyte Brix set up running Debian so that I could do some more reasonable performance characterization of Glamor in its current state. I wanted to use this particular machine because it has enough cooling to keep from thermally throttling the CPU/GPU package.

This is running my glamor-server branch of the X server, which completes the core operation rework and then has some core X server performance improvements as well for filled and outlined arcs.

Changes in X11perf

First off, I did some analysis of what x11perf was doing and found that it wasn’t quite measuring what we thought. I’m not interested in competing on x11perf numbers absolutely, I’m only interested in doing relative measurements of useful operations, so fixing the tool to measure what I want seems reasonable to me.

When x11perf was first written, it drew 100x100 rectangles tight against one another without any gap. And, it filled the window with them, drawing a 6x6 grid of 100x100 rectangles in a 600x600 window. To better exercise the rectangle code and check edge conditions better, we added a one pixel gap between the rectangles. However, we didn’t reduce the number of rectangles drawn, so we ended up drawing 11 of the 36 rectangles on top of the first set of 25. Simple region computations would allow the X server to draw only 25 most of the time, skipping the redundant rectangles.

The vertical and horizontal line tests were added a while after the first set of tests, and were done without regard to how an X server might optimize for them. x11perf draws these lines packed tightly together, creating a single square of pixels for the result.

EXA, UXA and SNA all take vertical and horizontal lines and convert them to rectangles, then take the rectangles and clip them against the window clip list by computing a region from them and intersecting that with the GC composite clip. It’s a completely reasonable plan, however, when you take what x11perf was drawing and run it through this code, you end up with a single solid rectangle. Which is surprisingly fast, compared with drawing individual lines.

I “fixed” the overlapping rectangle case by reducing the number of boxes drawn from 36 to 25, and I fixed the vertical and horizontal line case by spacing the lines a pixel apart.

I’ve pushed out these changes to my x11perf repository on freedesktop.org.

What’s Fast

Things that match GL’s capabilities are fast, things which don’t are slow. No surprises there. What’s interesting is precisely what matches GL

Patterns For Free

Because GL makes it easy to program fill patterns into the GPU, there are essentially no performance differences between solid and patterned operations.

GL Lines

Glamor uses GL lines, which can be programmed to match X semantics, to quite good effect. The only trick required was to deal with cap styles. GL never draws the final pixel in a line, while X does unless the cap style is CapNotLast. The solution was to draw an extra line segment containing a single pixel at the end of every joined set of lines for this case.

The other implicit requirement is that all zero width lines look the same. Right now, I’ve solved that for fill styles and raster ops as they’re all drawn with the same GL operations. However, for plane masks, we’re currently falling back to software, which may draw something different. Fixing that isn’t impossible, it’s just tedious.

Text

Pushing all of the work of drawing core text into glamor wasn’t terribly difficult, and the results are pretty spectacular.

What’s Slow

We’ve still got room for improvement in Glamor, but there aren’t any obvious show-stoppers to getting great performance for reasonable X applications anymore.

Wide Lines and Arcs

One of the speed-ups I’ve made in my glamor branch is to merge all of drawing of multiple filled and zero-width arcs into a single giant GL request. That turned out to both improve performance and save a bit of code. Right now, drawing wide lines and wide arcs doesn’t do this, and so we suffer from submitting many smaller requests to GL. It’s hard to get excited about speeding any of this up as all of the wide primitives are essentially unused these days.

Filled Polygons

Because X only lets applications draw a single polygon in each request, Glamor can’t really gain any efficiency from batching work unless we start looking ahead in the X protocol stream to see if the next request is another polygon. Alternatively, we could leave the span operation pending to see if more spans were coming before the X serve went idle. Neither of these is all that exciting though; X polygons just aren’t that useful.

Render Operations

These are still not structured to best fit modern GL; some work here would help a bunch. We’ve got a gsoc student ready to go at this though, so I expect we’ll have much better numbers in a few months.

Window Operations

You wouldn’t think that moving and resizing windows would be so limited by drawing performance, but x11perf tests these with tiny little windows, and each operation draws or copies only a couple of little rectangles, which makes GL quite expensive. Working on speeding up GL for small numbers of operations would help a bunch here.

Unexpected Results

Solid rectangles are actually running slower than patterned rectangles, and I really have no idea why. The CPU is close to idle during the 500x500 solid rectangle test (as you’d expect, given the workload), the vertex and fragment shaders look correct out of the compiler, and yet solid rectangles run at only 0.80 of the performance of the patterned rectangles.

GL semantics for copying data essentially preclude overlapping blts of any form. There’s the NVtexturebarrier extension which at least lets us do blts within the same object, but even that doesn’t define how overlapping blts work. So, we have to create a temporary copy for this operation to make it work right. I was worried that this would slow things down, but the Iris Pro 3D engine is enough faster than the 2D engine that even with the extra copy, large scrolls and copies within the same object are actually faster.

Results

Here’s a giant image showing the ratio of Glamor to both UXA and SNA running on the same machine, with all of the same software; the only change between runs was to switch the configured acceleration architecture.

Posted Sun Apr 27 01:46:50 2014 Tags:

Java Sound on Linux

I’m often in the position of having my favorite Java program (AltosUI) unable to make any sounds. Here’s a history of the various adventures I’ve had.

Java and PulseAudio ALSA support

When we started playing with Java a few years ago, we discovered that if PulseAudio were enabled, Java wouldn’t make any sound. Presumably, that was because the ALSA emulation layer offered by PulseAudio wasn’t capable of supporting Java.

The fix for that was to make sure pulseaudio would never run. That’s harder than it seems; pulseaudio is like the living dead; rising from the grave every time you kill it. As it’s nearly impossible to install any desktop applications without gaining a bogus dependency on pulseaudio, the solution that works best is to make sure dpkg never manages to actually install the program with dpkg-divert:

# dpkg-divert --rename /usr/bin/pulseaudio

With this in place, Java was a happy camper for a long time.

Java and PulseAudio Native support

More recently, Java has apparently gained some native PulseAudio support in some fashion. Of course, I couldn’t actually get it to work, even after running the PulseAudio daemon but some kind Debian developer decided that sound should be broken by default for all Java applications and selected the PulseAudio back-end in the Java audio configuration file.

Fixing that involved learning about said Java audio configuration file and then applying patch to revert the Debian packaging damage.

$ cat /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/sound.properties
...
#javax.sound.sampled.Clip=org.classpath.icedtea.pulseaudio.PulseAudioMixerProvider
#javax.sound.sampled.Port=org.classpath.icedtea.pulseaudio.PulseAudioMixerProvider
#javax.sound.sampled.SourceDataLine=org.classpath.icedtea.pulseaudio.PulseAudioMixerProvider
#javax.sound.sampled.TargetDataLine=org.classpath.icedtea.pulseaudio.PulseAudioMixerProvider

javax.sound.sampled.Clip=com.sun.media.sound.DirectAudioDeviceProvider
javax.sound.sampled.Port=com.sun.media.sound.PortMixerProvider
javax.sound.sampled.SourceDataLine=com.sun.media.sound.DirectAudioDeviceProvider
javax.sound.sampled.TargetDataLine=com.sun.media.sound.DirectAudioDeviceProvider

You can see the PulseAudio mistakes at the top of that listing, with the corrected native interface settings at the bottom.

Java and single-open ALSA drivers

It used to be that ALSA drivers could support multiple applications having the device open at the same time. Those with hardware mixing would use that to merge the streams together; those without hardware mixing might do that in the kernel itself. While the latter is probably not a great plan, it did make ALSA a lot more friendly to users.

My new laptop is not friendly, and returns EBUSY when you try to open the PCM device more than once.

After downloading the jdk and alsa library sources, I figured out that Java was trying to open the PCM device multiple times when using the standard Java sound API in the simplest possible way. I thought I was going to have to fix Java, when I figured out that ALSA provides user-space mixing with the ‘dmix’ plugin. I enabled that on my machine and now all was well.

$ cat /etc/asound.conf
pcm.!default {
    type plug
    slave.pcm "dmixer"
}

pcm.dmixer  {
    type dmix
    ipc_key 1024
    slave {
        pcm "hw:1,0"
        period_time 0
        period_size 1024
        buffer_size 4096
        rate 44100
    }
    bindings {
        0 0
        1 1
    }
}

ctl.dmixer {
    type hw
    card 1
}

ctl.!default {
    type hw
    card 1
}

As you can see, my sound card is not number 0, it’s number 1, so if your card is a different number, you’ll have to adapt as necessary.

Posted Sat Apr 5 22:30:55 2014 Tags:

Core Rendering with Glamor

I’ve hacked up the intel driver to bypass all of the UXA paths when Glamor is enabled so I’m currently running an X server that uses only Glamor for all rendering. There are still too many fall backs, and performance for some operations is not what I’d like, but it’s entirely usable. It supports DRI3, so I even have GL applications running.

Core Rendering Status

I’ve continued to focus on getting the core X protocol rendering operations complete and correct; those remain a critical part of many X applications and are a poor match for GL. At this point, I’ve got accelerated versions of the basic spans functions, filled rectangles, text and copies.

GL and Scrolling

OpenGL has been on a many-year vendetta against one of the most common 2D accelerated operations — copying data within the same object, even when that operation overlaps itself. This used to be the most performance-critical operation in X; it was used for scrolling your terminal windows and when moving windows around on the screen.

Reviewing the OpenGL 3.x spec, Eric and I both read the glCopyPixels specification as clearly requiring correct semantics for overlapping copy operations — it says that the operation must be equivalent to reading the pixels and then writing the pixels. My CopyArea acceleration thus uses this path for the self-copy case. However, the ARB decided that having a well defined blt operation was too nice to the users, so the current 4.4 specification adds explicit language to assert that this is not well defined anymore (even in the face of the existing language which is pretty darn unambiguous).

I suspect we’ll end up creating an extension that offers what we need here; applications are unlikely to stop scrolling stuff around, and GPUs (at least Intel) will continue to do what we want. This is the kind of thing that makes GL maddening for 2D graphics — the GPU does what we want, and the GL doesn’t let us get at it.

For implementations not capable of supporting the required semantic, someone will presumably need to write code that creates a temporary copy of the data.

PBOs for fall backs

For operations which Glamor can’t manage, we need to fall back to using a software solution. Direct-to-hardware acceleration architectures do this by simply mapping the underlying GPU object to the CPU. GL doesn’t provide this access, and it’s probably a good thing as such access needs to be carefully synchronized with GPU access, and attempting to access tiled GPU objects with the CPU require either piles of CPU code to ‘de-tile’ accesses (ala wfb), or special hardware detilers (like the Intel GTT).

However, GL does provide a fairly nice abstraction called pixel buffer objects (PBOs) which work to speed access to GPU data from the CPU.

The fallback code allocates a PBO for each relevant X drawable, asks GL to copy pixels in, and then calls fb, with the drawable now referencing the temporary buffer. On the way back out, any potentially modified pixels are copied back through GL and the PBOs are freed.

This turns out to be dramatically faster than malloc’ing temporary buffers as it allows the GL to allocate memory that it likes, and for it to manage the data upload and buffer destruction asynchronously.

Because X pixmaps can contain many X windows (the root pixmap being the most obvious example), they are often significantly larger than the actual rendering target area. As an optimization, the code only copies data from the relevant area of the pixmap, saving considerable time as a result. There’s even an interface which further restricts that to a subset of the target drawable which the Composite function uses.

Using Scissoring for Clipping

The GL scissor operation provides a single clipping rectangle. X provides a list of rectangles to clip to. There are two obvious ways to perform clipping here — either perform all clipping in software, or hand each X clipping rectangle in turn to GL and re-execute the entire rendering operation for each rectangle.

You’d think that the former plan would be the obvious choice; clearly re-executing the entire rendering operation potentially many times is going to take a lot of time in the GPU.

However, the reality is that most X drawing occurs under a single clipping rectangle. Accelerating this common case by using the hardware clipper provides enough benefit that we definitely want to use it when it works. We could duplicate all of the rendering paths and perform CPU-based clipping when the number of rectangles was above some threshold, but the additional code complexity isn’t obviously worth the effort, given how infrequently it will be used. So I haven’t bothered. Most operations look like this:

Allocate VBO space for data

Fill VBO with X primitives

loop over clip rects {
    glScissor()
    glDrawArrays()
}

This obviously out-sources as much of the problem as possible to the GL library, reducing the CPU time spent in glamor to a minimum.

A Peek at Some Code

With all of these changes in place, drawing something like a list of rectangles becomes a fairly simple piece of code:

First, make sure the program we want to use is available and can be used with our GC configuration:

prog = glamor_use_program_fill(pixmap, gc,
                               &glamor_priv->poly_fill_rect_program,
                               &glamor_facet_polyfillrect);

if (!prog)
    goto bail_ctx;

Next, allocate the VBO space and copy all of the X data into it. Note that the data transfer is simply ‘memcpy’ here — that’s because we break the X objects apart in the vertex shader using instancing, avoiding the CPU overhead of computing four corner coordinates.

/* Set up the vertex buffers for the points */

v = glamor_get_vbo_space(drawable->pScreen, nrect * (4 * sizeof (GLshort)), &vbo_offset);

glEnableVertexAttribArray(GLAMOR_VERTEX_POS);
glVertexAttribDivisor(GLAMOR_VERTEX_POS, 1);
glVertexAttribPointer(GLAMOR_VERTEX_POS, 4, GL_SHORT, GL_FALSE,
                      4 * sizeof (GLshort), vbo_offset);

memcpy(v, prect, nrect * sizeof (xRectangle));

glamor_put_vbo_space(screen);

Finally, loop over the pixmap tile fragments, and then over the clip list, selecting the drawing target and painting the rectangles:

glEnable(GL_SCISSOR_TEST);

glamor_pixmap_loop(pixmap_priv, box_x, box_y) {
    int nbox = RegionNumRects(gc->pCompositeClip);
    BoxPtr box = RegionRects(gc->pCompositeClip);

    glamor_set_destination_drawable(drawable, box_x, box_y, TRUE, FALSE, prog->matrix_uniform, &off_x, &off_y);

    while (nbox--) {
        glScissor(box->x1 + off_x,
                  box->y1 + off_y,
                  box->x2 - box->x1,
                  box->y2 - box->y1);
        box++;
        glDrawArraysInstanced(GL_TRIANGLE_STRIP, 0, 4, nrect);
    }
}

GL texture size limits

X pixmaps use 16 bit dimensions for width and height, allowing them to be up to 65536 x 65536 pixels. Because the X coordinate space is signed, only a quarter of this space is actually useful, which makes the useful size of X pixmaps only 32767 x 32767. This is still larger than most GL implementations offer as a maximum texture size though, and while it would be nice to just say ‘we don’t allow pixmaps larger than GL textures’, the reality is that many applications expect to be able to allocate such pixmaps today, largely to hold the ever increasing size of digital photographs.

Glamor has always supported large X pixmaps; it does this by splitting them up into tiles, each of which is no larger than the largest texture supported by the driver. What I’ve added to Glamor is some simple macros that walk over the array of tiles, making it easy for the rendering code to support large pixmaps without needing any special case code.

Glamor also had some simple testing support — you can compile the code to ignore the system-provided maximum texture size and supply your own value. This code had gone stale, and couldn’t work as there were parts of the code for which tiling support just doesn’t make sense, like the glyph cache, or the X scanout buffer. I fixed things so that you could leave those special cases as solitary large tiles while breaking up all other pixmaps into tiles no larger than 32 pixels square.

I hope to remove the single-tile case and leave the code supporting only the multiple-tile case; we have to have the latter code, and so having the single-tile code around simply increases our code size for not obvious benefit.

Getting accelerated copies between tiled pixmaps added a new coordinate system to the mix and took a few hours of fussing until it was working.

Rebasing Many (many) Times

I’m sure most of us remember the days before git; changes were often monolithic, and the notion of changing how the changes were made for the sake of clarity never occurred to anyone. It used to be that the final code was the only interesting artifact; how you got there didn’t matter to anyone. Things are different today; I probably spend a third of my development time communicating how the code should change with other developers by changing the sequence of patches that are to be applied.

In the case of Glamor, I’ve now got a set of 28 patches. The first few are fixes outside of the glamor tree that make the rest of the server work better. Then there are a few general glamor infrastructure additions. After that, each core operation is replaced, one a at a time. Finally, a bit of stale code is removed. By sequencing things in a logical fashion, I hope to make review of the code easier, which should mean that people will spend less time trying to figure out what I did and be able to spend more time figuring out if what I did is reasonable and correct.

Supporting Older Versions of GL

All of the new code uses vertex instancing to move coordinate computation from the CPU to the GPU. I’m also pulling textures apart using integer operations. Right now, we should correctly fall back to software for older hardware, but it would probably be nicer to just fall back to simpler GL instead. Unless everyone decides to go buy hardware with new enough GL driver support, someone is going to need to write simplified code paths for glamor.

If you’ve got such hardware, and are interested in making it work well, please take this as an opportunity to help yourself and others.

Near-term Glamor Goals

I’m pretty sure we’ll have the code in good enough shape to merge before the window closes for X server 1.16. Eric is in charge of the glamor tree, so it’s up to him when stuff is pulled in. He and Markus Wick have also been generating code and reviewing stuff, but we could always use additional testing and review to make the code as good as possible before the merge window closes.

Markus has expressed an interest in working on Glamor as a part of the X.org summer of code this year; there’s clearly plenty of work to do here, Eric and I haven’t touched the render acceleration stuff at all, and that code could definitely use some updating to use more modern GL features.

If that works as well as the core rendering code changes, then we can look forward to a Glamor which offers GPU-limited performance for classic X applications, without requiring GPU-specific drivers for every generation of every chip.

Posted Sat Mar 22 01:20:34 2014 Tags:

Brief Glamor Hacks

Eric Anholt started writing Glamor a few years ago. The goal was to provide credible 2D acceleration based solely on the OpenGL API, in particular, to implement the X drawing primitives, both core and Render extension, without any GPU-specific code. When he started, the thinking was that fixed-function devices were still relevant, so that original code didn’t insist upon “modern” OpenGL features like GLSL shaders. That made the code less efficient and hard to write.

Glamor used to be a side-project within the X world; seen as something that really wasn’t very useful; something that any credible 2D driver would replace with custom highly-optimized GPU-specific code. Eric and I both hoped that Glamor would turn into something credible and that we’d be able to eliminate all of the horror-show GPU-specific code in every driver for drawing X text, rectangles and composited images. That hadn’t happened though, until now.

Fast forward to the last six months. Eric has spent a bunch of time cleaning up Glamor internals, and in fact he’s had it merged into the core X server for version 1.16 which will be coming up this July. Within the Glamor code base, he’s been cleaning some internal structures up and making life more tolerable for Glamor developers.

Using libepoxy

A big part of the cleanup was a transition all of the extension function calls to use his other new project, libepoxy, which provides a sane, consistent and performant API to OpenGL extensions for Linux, Mac OS and Windows. That library is awesome, and you should use it for everything you do with OpenGL because not using it is like pounding nails into your head. Or eating non-tasty pie.

Using VBOs in Glamor

One thing he recently cleaned up was how to deal with VBOs during X operations. VBOs are absolutely essential to modern OpenGL applications; they’re really the only way to efficiently pass vertex data from application to the GPU. And, every other mechanism is deprecated by the ARB as not a part of the blessed ‘core context’.

Glamor provides a simple way of getting some VBO space, dumping data into it, and then using it through two wrapping calls which you use along with glVertexAttribPointer as follows:

pointer = glamor_get_vbo_space(screen, size, &offset);
glVertexAttribPointer(attribute_location, count, type,
              GL_FALSE, stride, offset);
memcpy(pointer, data, size);
glamor_put_vbo_space(screen);

glamor_get_vbo_space allocates the specified amount of VBO space and returns a pointer to that along with an ‘offset’, which is suitable to pass to glVertexAttribPointer. You dump your data into the returned pointer, call glamor_put_vbo_space and you’re all done.

Actually Drawing Stuff

At the same time, Eric has been optimizing some of the existing rendering code. But, all of it is still frankly terrible. Our dream of credible 2D graphics through OpenGL just wasn’t being realized at all.

On Monday, I decided that I should go play in Glamor for a few days, both to hack up some simple rendering paths and to familiarize myself with the insides of Glamor as I’m getting asked to review piles of patches for it, and not understanding a code base is a great way to help introduce serious bugs during review.

I started with the core text operations. Not because they’re particularly relevant these days as most applications draw text with the Render extension to provide better looking results, but instead because they’re often one of the hardest things to do efficiently with a heavy weight GPU interface, and OpenGL can be amazingly heavy weight if you let it.

Eric spent a bunch of time optimizing the existing text code to try and make it faster, but at the bottom, it actually draws each lit pixel as a tiny GL_POINT object by sending a separate x/y vertex value to the GPU (using the above VBO interface). This code walks the array of bits in the font and checking each one to see if it is lit, then checking if the lit pixel is within the clip region and only then adding the coordinates of the lit pixel to the VBO. The amazing thing is that even with all of this CPU and GPU work, the venerable 6x13 font is drawn at an astonishing 3.2 million glyphs per second. Of course, pure software draws text at 9.3 million glyphs per second.

I suspected that a more efficient implementation might be able to draw text a bit faster, so I decided to just start from scratch with a new GL-based core X text drawing function. The plan was pretty simple:

  1. Dump all glyphs in the font into a texture. Store them in 1bpp format to minimize memory consumption.

  2. Place raw (integer) glyph coordinates into the VBO. Place four coordinates for each and draw a GL_QUAD for each glyph.

  3. Transform the glyph coordinates into the usual GL range (-1..1) in the vertex shader.

  4. Fetch a suitable byte from the glyph texture, extract a single bit and then either draw a solid color or discard the fragment.

This makes the X server code surprisingly simple; it computes integer coordinates for the glyph destination and glyph image source and writes those to the VBO. When all of the glyphs are in the VBO, it just calls glDrawArrays(GL_QUADS, 0, 4 * count). The results were “encouraging”:

1: fb-text.perf
2: glamor-text.perf
3: keith-text.perf

       1                 2                           3                 Operation
------------   -------------------------   -------------------------   -------------------------
   9300000.0      3160000.0 (     0.340)     18000000.0 (     1.935)   Char in 80-char line (6x13) 
   8700000.0      2850000.0 (     0.328)     16500000.0 (     1.897)   Char in 70-char line (8x13) 
   6560000.0      2380000.0 (     0.363)     11900000.0 (     1.814)   Char in 60-char line (9x15) 
   2150000.0       700000.0 (     0.326)      7710000.0 (     3.586)   Char16 in 40-char line (k14) 
    894000.0       283000.0 (     0.317)      4500000.0 (     5.034)   Char16 in 23-char line (k24) 
   9170000.0      4400000.0 (     0.480)     17300000.0 (     1.887)   Char in 80-char line (TR 10) 
   3080000.0      1090000.0 (     0.354)      7810000.0 (     2.536)   Char in 30-char line (TR 24) 
   6690000.0      2640000.0 (     0.395)      5180000.0 (     0.774)   Char in 20/40/20 line (6x13, TR 10) 
   1160000.0       351000.0 (     0.303)      2080000.0 (     1.793)   Char16 in 7/14/7 line (k14, k24) 
   8310000.0      2880000.0 (     0.347)     15600000.0 (     1.877)   Char in 80-char image line (6x13) 
   7510000.0      2550000.0 (     0.340)     12900000.0 (     1.718)   Char in 70-char image line (8x13) 
   5650000.0      2090000.0 (     0.370)     11400000.0 (     2.018)   Char in 60-char image line (9x15) 
   2000000.0       670000.0 (     0.335)      7780000.0 (     3.890)   Char16 in 40-char image line (k14) 
    823000.0       270000.0 (     0.328)      4470000.0 (     5.431)   Char16 in 23-char image line (k24) 
   8500000.0      3710000.0 (     0.436)      8250000.0 (     0.971)   Char in 80-char image line (TR 10) 
   2620000.0       983000.0 (     0.375)      3650000.0 (     1.393)   Char in 30-char image line (TR 24)

This is our old friend x11perfcomp, but slightly adjusted for a modern reality where you really do end up drawing billions of objects (hence the wider columns). This table lists the performance for drawing a range of different fonts in both poly text and image text variants. The first column is for Xephyr using software (fb) rendering, the second is for the existing Glamor GL_POINT based code and the third is the latest GL_QUAD based code.

As you can see, drawing points for every lit pixel in a glyph is surprisingly fast, but only about 1/3 the speed of software for essentially any size glyph. By minimizing the use of the CPU and pushing piles of work into the GPU, we manage to increase the speed of most of the operations, with larger glyphs improving significantly more than smaller glyphs.

Now, you ask how much code this involved. And, I can truthfully say that it was a very small amount to write:

 Makefile.am        |    2 
 glamor.c           |    5 
 glamor_core.c      |    8 
 glamor_font.c      |  181 ++++++++++++++++++++
 glamor_font.h      |   50 +++++
 glamor_priv.h      |   26 ++
 glamor_text.c      |  472 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 glamor_transform.c |    2 
 8 files changed, 741 insertions(+), 5 deletions(-)

Let’s Start At The Very Beginning

The results of optimizing text encouraged me to start at the top of x11perf and see what progress I could make. In particular, looking at the current Glamor code, I noticed that it did all of the vertex transformation with the CPU. That makes no sense at all for any GPU built in the last decade; they’ve got massive numbers of transistors dedicated to performing precisely this kind of operation. So, I decided to see what I could do with PolyPoint.

PolyPoint is absolutely brutal on any GPU; you have to pass it two coordinates for each pixel, and so the very best you can do is send it 32 bits, or precisely the same amount of data needed to actually draw a pixel on the frame buffer. With this in mind, one expects that about the best you can do compared with software is tie. Of course, the CPU version is actually computing an address and clipping, but those are all easily buried in the cost of actually storing a pixel.

In any case, the results of this little exercise are pretty close to a tie — the CPU draws 190,000,000 dots per second and the GPU draws 189,000,000 dots per second. Looking at the vertex and fragment shaders generated by the compiler, it’s clear that there’s room for improvement.

The fragment shader is simply pulling the constant pixel color from a uniform and assigning it to the fragment color in this the simplest of all possible shaders:

uniform vec4 color;
void main()
{
       gl_FragColor = color;
};

This generates five instructions:

Native code for point fragment shader 7 (SIMD8 dispatch):
   START B0
   FB write target 0
0x00000000: mov(8)          g113<1>F        g2<0,1,0>F                      { align1 WE_normal 1Q };
0x00000010: mov(8)          g114<1>F        g2.1<0,1,0>F                    { align1 WE_normal 1Q };
0x00000020: mov(8)          g115<1>F        g2.2<0,1,0>F                    { align1 WE_normal 1Q };
0x00000030: mov(8)          g116<1>F        g2.3<0,1,0>F                    { align1 WE_normal 1Q };
0x00000040: sendc(8)        null            g113<8,8,1>F
                render ( RT write, 0, 4, 12) mlen 4 rlen 0      { align1 WE_normal 1Q EOT };
   END B0

As this pattern is actually pretty common, it turns out there’s a single instruction that can replace all four of the moves. That should actually make a significant difference in the run time of this shader, and this shader runs once for every single pixel.

The vertex shader has some similar optimization opportunities, but it only runs once for every 8 pixels — with the SIMD format flipped around, the vertex shader can compute 8 vertices in parallel, so it ends up executing 8 times less often. It’s got some redundant moves, which could be optimized by improving the copy propagation analysis code in the compiler.

Of course, improving the compiler to make these cases run faster will probably make a lot of other applications run faster too, so it’s probably worth doing at some point.

Again, the amount of code necessary to add this path was tiny:

 Makefile.am        |    1 
 glamor.c           |    2 
 glamor_polyops.c   |  116 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 glamor_priv.h      |    8 +++
 glamor_transform.c |  118 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 glamor_transform.h |   51 ++++++++++++++++++++++
 6 files changed, 292 insertions(+), 4 deletions(-)

Discussion of Results

These two cases, text and points, are probably the hardest operations to accelerate with a GPU and yet a small amount of OpenGL code was able to meet or beat software easily. The advantage of this work over traditional GPU 2D acceleration implementations should be pretty clear — this same code should work well on any GPU which offers a reasonable OpenGL implementation. That means everyone shares the benefits of this code, and everyone can contribute to making 2D operations faster.

All of these measurements were actually done using Xephyr, which offers a testing environment unlike any I’ve ever had — build and test hardware acceleration code within a nested X server, debugging it in a windowed environment on a single machine. Here’s how I’m running it:

$ ./Xephyr  -glamor :1 -schedMax 2000 -screen 1024x768 -retro

The one bit of magic here is the ‘-schedMax 2000’ flag, which causes Xephyr to update the screen less often when applications are very busy and serves to reduce the overhead of screen updates while running x11perf.

Future Work

Having managed to accelerate 17 of the 392 operations in x11perf, it’s pretty clear that I could spend a bunch of time just stepping through each of the remaining ones and working on them. Before doing that, we want to try and work out some general principles about how to handle core X fill styles. Moving all of the stipple and tile computation to the GPU will help reduce the amount of code necessary to fill rectangles and spans, along with improving performance, assuming the above exercise generalizes to other primitives.

Getting and Testing the Code

Most of the changes here are from Eric’s glamor-server branch:

git://people.freedesktop.org/~anholt/xserver glamor-server

The two patches shown above, along with a pair of simple clean up patches that I’ve written this week are available here:

git://people.freedesktop.org/~keithp/xserver glamor-server

Of course, as this now uses libepoxy, you’ll need to fetch, build and install that before trying to compile this X server.

Because you can try all of this out in Xephyr, it’s easy to download and build this X server and then run it right on top of your current driver stack inside of X. I’d really like to hear from people with Radeon or nVidia hardware to know whether the code works, and how it compares with fb on the same machine, which you get when you elide the ‘-glamor’ argument from the example Xephyr command line above.

Posted Fri Mar 7 01:12:28 2014 Tags:

All Entries