Bennett's Blog: Super Hexagon: Measuring and optimizing PC rendering latency

Maintaining low input latency is crucial to delivering a enjoyable user experience for PC games. In this post, ‘Super Hexagon’ is used as an example to show how to show how latency caused by the OS/graphics stack can be measured in games using GPUView and minimized through careful submission of rendering work.

Finally, a proof-of-concept patch is provided at the end that modifies the game’s Windows API usage to significantly reduce its latency.

Background

In September 2012, the game Super Hexagon was released for iOS. Its addictive twitch-reflex gameplay and awesome soundtrack won it widespread praise, and it remains on my phone to this day.

"Hardestestest" is accurate

A few months later, the same game was ported to the PC. Given the low price, I was happy to double-dip to be able to play it through again using keyboard input. As I started playing, however, the game didn’t seem as responsive as I remembered. Searching online, others seemed to have the same issue:

http://steamcommunity.com/app/221640/discussions/0/846963165508613603/

http://gamespresentedwithoutcommentary.tumblr.com/post/37264077015/super-hexagon-review

Disabling V-Sync fixes the issue, at the cost of introducing screen tearing.

Recently, using publicly available tools, I analyzed the game’s execution and found that the game’s usage of (and the inaccuracy of) the Win32 Sleep API contributes to up to 2 extra frames, or 33.3ms, of latency when V-Sync is enabled.

In this post, I’ll describe how to examine the behavior and timing of the Windows graphics stack, show why Super Hexagon's API usage results in additional latency, and offer a patch that changes its behavior to fix the issue. A follow-up post will go into detail on the executable was modified without source access and cover the exact behavior the patch uses to achieve optimal latency.

The goal is to provide information for game developers on how to examine the interactions between their game and the OS (to ensure that the game's timing/synchronization behavior is intended). Similarly the post will show end-users how to measure latency in their purchased games (and the impact of enabling/disabling V-Sync, for example).

Defining Input Latency

Input latency for games can be defined as the time between a user taking an action and the result of that action showing up on screen. If latency becomes too high, a game can feel unresponsive and become frustrating to play.

There are many factors that can add latency:

External factors - e.g. LCD Monitor pixel response time, input peripheral update speed, how long before the OS processes the input and notifies the game, ...
Internal factors - Whether the game spends much time processing the input before updating the world state (maybe spending time performing physics calculations)
OS/Graphics Stack factors - Is there a delay between the game calling Direct3D/OpenGL APIs and the graphics hardware actually processing the request? What about presenting the finished frame to the screen?
Else?

This post will focus on the OS/Graphics Stack factors, because games can influence them (unlike external factors) and because they are more of a black box and and sometimes non-intuitive (unlike internal factors).

Measuring OS/Graphics Stack latency

A game draws a new frame first by calling Direct3D/OpenGL rendering APIs, and displays it by calling IDXGISwapChain::Present (or SwapBuffers for OpenGL). More explicitly, a high-level list of what occurs each frame is as follows:

For rendering the frame:

Game calls Direct3D/OpenGL rendering APIs.
The graphics user-mode driver (UMD) receives these calls (directly in OpenGL's case, or from the Direct3D runtime in Direct3D's case) and begins generating a command buffer containing the actual hardware-specific machine code that corresponds the game's rendering API calls.
At the UMD's election (or in response to a Present/SwapBuffers/Flush command by the game), it instructs the OS graphics scheduler to execute the command buffer.
The OS graphics scheduler stores the command buffer in a queue for that particular API device/process.
Once the following criteria are satisfied, the OS scheduler ensures all referenced resources are resident in GPU memory, patches the command buffer to replace any resource references with the actual physical addresses (only if needed, certain GPUs/drivers support virtual addressing) and instructs the kernel-mode driver (KMD) to submit the command buffer to the actual GPU hardware queue. The criteria are (at least):

The command buffer has reached the front of the queue for that particular API device/process.
The swap chain back buffer becomes available (is not being read by the display).
The OS scheduler chooses to allocate execution time to the game (command buffers from apps such as the Desktop Windows Manager in Windows always have a higher priority)

The command buffer sits in the GPU hardware queue (typically only two deep) until it reaches the front and is executed by the GPU. The outputted frame data is stored in a 'primary' surface, that is a surface eligible (in terms of memory layout/location) to be directly scanned out by the display hardware.

For displaying the frame:

Once the game instructs the OS to present the frame (via IDXGISwapChain::Present or SwapBuffers), the OS waits until the GPU hardware signals that the work associated with that frame has completed.
When the OS is signaled, it either:

If V-Sync is disabled, it instructs the KMD to program the display subsystem to immediately begin scanning out from the new frame's physical address in memory.
If V-Sync is enabled, it instructs the KMD to enqueue a command to begin scanning out from the new frame's physical address upon the the next VBlank.

In the case of Super Hexagon, we want to measure the time taken for both of these steps, without access to the game’s source code. Fortunately, Windows provides a very powerful tool for analyzing the execution of an app in the context of the overall system: GPUView.

GPUView is really two things: 1) a script “log.cmd” which instructs XPerf to record certain events provided by the NT and Graphics kernels, and 2) a GUI (gpuview.exe) to interpret the recorded logs into a graphical representation of the system’s state at any time. Basic usage of GPUView (starting/stopping logging, opening traces) can be found here: http://msdn.microsoft.com/en-us/library/windows/desktop/jj585574(v=vs.85).aspx

For Super Hexagon, I took a quick 9 second trace (link at end of post). Here’s what that looks like in GPUView:

GPUView only shows threads that it thinks are 'interesting' by default. Press '+' to display more threads.

You can see game execution begins about a fifth of the way through the trace (where thread 1572 is created). Shortly thereafter, rendering work begins (where the OpenGL driver's background thread [5412] is created). Near the bottom part of the screen, you can see rendering work (command buffers generated by the OpenGL user-mode driver) from the game in the CPU Queue.

If you look, though, there appears to be some sort of periodic activity in the queue. By pressing ‘f8’ to display each VBlank interrupt and then zooming in, we can get a better picture of how an individual frame is rendered.

The blue vertical lines represent Vertical Blank (VBlank) interrupts, which occur every 16.7ms at 60 Hz.

In GPUView, there are three distinct areas of the interface:

The GPU Hardware Queue shows what command buffers have been submitted to the GPU by the OS scheduler, and which, if any the GPU is executing at any given time. Only the single command buffer at the front of the queue can be executed at any time.
The Flip Queue shows what Present commands have been queued and are pending execution (e.g. being scanned out to display). Solid orange means work associated with that Present still remains, while the cross-hash pattern means all rendering has completed and the hardware has already been programmed to begin reading from the surface at the next VBlank.
CPU Processes, where each thread in a process is displayed separately. A colored block (such as the pink blocks in SuperHexagon.exe) means that thread is currently active on the CPU (each color corresponds to a particular CPU core).

The 'Context CPU Queue' shows all command buffers that have been generated by the app's calling of 3D rendering APIs, such as Direct3D and OpenGL. The command buffers contain hardware-specific rendering commands that correspond to the specific API usage, and are generated by the graphics user-mode driver (UMD).
In Super Hexagon's case, there are two command buffers per frame for this particular driver. The orange block represents the Present command, which is executed once all rendering work for the frame (the two command buffers) has been completed.

As shown in the annotated diagram below, there is a general flow to Super Hexagon’s execution. The main game thread wakes up a a certain time (the reason behind the exact timing will be discussed later), updates the game state, calls OpenGL rendering APIs, the graphics user-mode driver processes those commands on a background and generate a command buffer, the GPU executes that command buffer, and the result is eventually sent to the display.

At this point in the trace, we can see things aren’t quite ideal – there is a gap between when the app calls the OpenGL APIs and when the hardware actually executes them, and then another gap before the resulting rendered image is sent to the display. But, altogether, the start to-finish time is no worse than about 1.5 frames ( ~25ms).

Looking elsewhere in the trace, we can see the worst-case situation:

Here, there is a 2.75 frame, or 46ms, gap between when the app reads the input, and when the result begins to be scanned out to the screen.

A couple things are wrong here: the app wakes up, but the graphics driver takes roughly a frame to generate the corresponding command buffer. Once the command buffer is generated, it isn't scheduled by the OS graphics scheduler until the beginning of the next frame, at which point it finally begins to be displayed after waiting nearly 16.7ms for the next VBlank to occur.

Problem

The basic problem is that the app is submitting rendering work before the GPU is able to execute it. The first screen-shot shows when the app is only a frame ahead of the GPU, while the second shows when the app is two frames ahead of the GPU.

Particularly, the UMD appears to busy-wait to avoid being too far ahead of the GPU (the exact reason is not clear, it is spinning calling gdi32.dll). Similarly, the OS scheduler then waits for the primary surface to finish being scanned out by the display to submit the command buffer to the GPU.

What causes the app to get ahead of the GPU? As part of the log.cmd script, GPUView instructs the OS to record the call stack each time a thread's execution is resumed. Looking at that, we can see what function caused the thread to be suspended by the NT kernel. Examining the call stacks for the main thread GPUView shows that for the main app thread, the thread is waiting on the Win32 Sleep API inside of GLUT’s (an OpenGL framework the game uses) idleWait callback.

Click the small circles below the rectangular blocks to view the callstack at that time. These ETW events (and others) can also manually browsed from Tools -> Event List.

However, measuring closely, the app is only sleeping for ~15.6ms each frame, causing the app to slowly run ahead of the GPU, which is only retiring frames every 16.7ms (1/60 Hz).

When debugging, as the app slowly runs further and further ahead of the GPU, the app calls the Sleep API with increasingly large numbers.

Unfortunately, Sleep is not a very precise API. As its MSDN documentation states, calling Sleep with a value below the systems Timer Frequency can cause the function to wait up until that period, while calling Sleep with a value greater than the frequency’s period can cause it to either wait one or two ticks.

Windows’ timer frequency defaults to 64Hz, or once every 15.625ms. Thus, calling Sleep with low values, as the app does, causes the thread to sleep for for 15.625ms. The app itself only takes about 0.3ms to process a frame on my machine, meaning the app wakes up about once every 15.96ms. causes the app to slowly run ahead of the GPU, which is retiring frames at a rate of 60 Hz, or once every 16.67ms.

Eventually, as the app gets further and further ahead of the GPU, the app uses larger and larger values for Sleep in an attempt to delay execution, until it finally calls Sleep(16). This is greater than the timer frequency, so the system sleeps for 15.625 * 2 = 31.25 ms, and the app goes back to only being only moderately ahead of the GPU. The worst-case part, shown earlier, is just before the app calls Sleep(16).

Solution

The ideal is that the app wakes up exactly once every monitor VBlank. Unfortunately, as shown, the Sleep API is not accurate enough by default for this.

One solution would be for the app to set Windows’ timer resolution as high as possible (1000 Hz, which has a resolution of 1ms), and to use a high-resolution timestamp like QueryPerformanceCounter to measure (and then correct for) any remaining error.

Another (potentially better) solution would be to rely on built-in Windows presentation APIs for throttling execution to once a frame, specifically DXGI. DXGI is a set of low-level APIs most commonly associated with Direct3D (a IDXGIDevice maps one-to-one to a ID3D11Device). However, DXGI contains a number of APIs that can be used independently of Direct3D, particularly IDXGIOutput::WaitForVBlank. WaitForVBlank causes the calling thread to block until graphics adaptor signals to the OS that the monitor has begun scanning out the next frame (on the next VBlank interrupt). This usage of interrupts is far more accurate that the timer-based Sleep.

To test this approach, the game’s executable was patched to call a combination of WaitForVBlank and Sleep instead of just Sleep. With that change, as shown below, the app wakes up precisely once each monitor VBlank, sleeps halfway through the frame, receives input events, and then submits rendering work, which is immediately executed and sent to the display.

Because the game's rendering workload is so quick, the additional Sleep causes the app to wait as long as possible in the frame before polling input messages. This reduces latency to the bare minimum, though this might not be appropriate for games with more demanding workloads.

Conclusion

The most important take away is that profiling is crucial for understanding game performance and behavior. Much of Window’s presentation scheduling can be controlled by the app, but it isn’t always clear the exact effect calling certain APIs will have. Unfortunately trial and error seems almost always necessary to achieve the desired result.

Please try the patch for yourself - I’ve posted the modifications to the executable as a dll, along with a patch to the executable that causes it to load the modified DLL. The patch can be applied with bspatch, linked below and also included in the zip. In the next post I’ll go into detail about how the DLL actually modifies the existing executable’s functionality without source access.

Please leave comments on anything I’ve missed or any errors in the post, the corrections would be appreciated! I’ve also posted the ETL files the screenshots are taken from below, which can be opened with GPUView.

...And of course a special thanks to Terry Cavanagh for making such a great game in the first place!

Links

Super Hexagon on Steam: http://store.steampowered.com/app/221640/

Super Hexagon patch: https://s3.amazonaws.com/bennettsorbo-static/20140720/superhexagon_latencypatch_v100.zip
GPUView package: http://www.microsoft.com/en-eg/download/details.aspx?id=39982 (included in Windows Performance Toolkit, which is in the Windows Assessment and Deployment Toolkit).

GPUView traces from post: https://s3.amazonaws.com/bennettsorbo-static/20140720/superhexagon_example_traces.zip

BSDiff: http://sites.inka.de/tesla/others.html#bsdiff