When searching for general optimization techniques on the net you’re often told to keep memory bandwidth usage at all stages of the rendering pipeline as low as possible. This post is specifically about reducing the bandwidth used when streaming vertex attributes to the vertex shader to make them available there. I go into detail for a hypothetical OpenGL 3.x+ (“Desktop GL”) implementation of the changes, but it should be analogous in DirectX or OpenGL ES as well.
Your vertex shader will require a number of input attributes that depend on what the shader is doing specifically. In my example, I’m using attributes that are probably common for vertex shaders that set up meshes for the fragment/pixel shader. It has the following inputs
- Position (vec3)
- Normal (vec3)
- Texture coordinates (vec2)
- Tangent of normal for normal mapping (vec3)
- Bitangent of normal for normal mapping (vec3)
These will probably be different for you. For example, many implementations don’t hand over tangents and bitangents as attributes, as it’s also possible to calculate them on the fly in the shader. You may also have additional attributes, such as per-vertex material IDs or something else that is required by your rendering pipeline. But these will do for the example.
For each distinct vertex then, your graphics device somehow has to stream all necessary data for that vertex from far away memory to some memory that’s nearer to the shader execution units (how exactly this is done and what types of memory there are depends on the hardware). This naturally consumes memory bandwidth, which is limited, and there’s also a small amount of latency associated with actually executing a memory fetch into main memory. Note that I said “for each distinct vertex”. If you use indexed drawing and use a vertex multiple times, your hardware implementation probably caches the required data somewhere so it doesn’t need to be fetched from far away memory multiple times.
The problem then, simply put, is to somehow reduce the required memory bandwidth and memory fetches to improve performance of rendering.
Strategy 1: Reduce size of stored attributes
One way to reduce memory bandwidth is to reduce the size of the actual data that is stored in memory for the vertex attributes. Remember: Getting data to the vertex shader is usually done like this
- Create vertex buffers using glGenBuffers(GL_ARRAY_BUFFER,…)
- Fill those buffers with data (e.g. glBufferData(…))
- Before rendering, tell OpenGL to use (and how to use) that data with glVertexAttrib[ / I /L]Pointer()
You supply the data type of the stored vertex attribute to glVertexAttribPointer, and logically you could reduce the amount of memory bandwidth if you somehow managed to send, e.g. a GL_HALF_FLOAT instead of a GL_FLOAT to the shader. The memory bandwidth for that specific attribute would be cut in half, and the amount of memory fetches would also go down, since fetches don’t just fetch a single attribute at a time, but a fixed size of bytes (e.g. one fetch could get 64 bytes out of memory, but the size of this depends on your hardware) and now more attributes fit “into one fetch”.
It’s important to note how the stuff specified on the client-language side (C++ etc.) actually arrives in the shader. Vertex attributes, as of Opengl 4.4, can only be of these data types:
- 32 bit floats
- 64 bit floats (double)
- 32 bit int and uint
- vectors of above types (e.g. vec3 for 3 32 bit floats, uvec2 for 2 unsigned integers,etc)
- matrices of 32 bit floats
This means that even if you hand over shorts or bytes with the respective glVertexAttribPointer() calls, they will be converted to (unnormalized, unless specified) 32 bit floats before the execution of the shader, 32 bit integers if you use glVertexAttribIPointer(), or 64 bit doubles if you use glVertexAttribLPointer(). As far as I know, this conversion is basically free in terms of performance.
Solution 1: Use the smallest data types that fits your data
In my example application, all values of all attributes are 32 bit floats, because they were naively copied to vertex buffers from the memory that my model loader provides me with.
This is a waste, however, because not all data requires 32 bits of precision to carry its information. For example, for texture coordinates, the amount of bits that you need only depend on the size of the biggest texture in your scene. A 16 bit (unsigned) integer can represent all numbers from 0 to 65536, which means that two 16 bit integers can correctly supply texture coordinates for textures of a maximum resolution of 65536×65536. For the vast majority of rendering applications (maybe with exceptions such as megatextures), this is more than enough. In your typical game, the largest texture you’re going to find is probably no bigger than 4096×4096.
So for texture coordinates, instead of saving a lot of 32 bit floats into my vertex buffer, I just used 16 bit half floats and changed the type parameter in glVertexAttribPointer to GL_HALF_FLOAT instead of GL_FLOAT. I didn’t have to change anything in the vertex shader, as OpenGL handles the conversion for you.
Note: I use GLM, which has a “half” data type which represents 16 bit floating points. Usually these will not be a native part of your language (such as C++).
Fiddling around with different combinations, I found that 16 bit data types were enough (as in: No difference in the output images) for texture coordinates, normals, tangents and bitangents. Changing vertex positions to 16 bit caused positioning artifacts, as some positions in the used model (Crytek Sponza) shifted a little due to lower precision. Depending on your other attributes, some may do fine with 8 bit data types as well (for example, vertex colors or material IDs).
Testing this on my laptop with a GTX 555m, the FPS with the Sponza scene (which has about 200k vertices) and a straight forward blinn phong lighting model + standard shadow mapping boosted the FPS from about ~50 to about ~80 FPS, so it was definitely worth it.
Strategy/Solution 2: Don’t use interleaved vertex attributes
There are essentially two ways to store vertex data: Interleaved and planar. Interleaved means that different vertex attributes in the vertex buffer are not separated by their type, but are instead mixed to follow in a repeating pattern one after the other. For example, for positions, normals, and texture coordinates, interleaved would be
And planar would be
When using the two different versions with shaders that used all attributes (such as mesh rendering) the performance of both versions was identical. However, there was a slight speedup on my PC (GTX 770) when used with a shader that only required a subset of the data. Specifically, my shadow map renderer only needs positions and normals, and the performance there shifted from 210 to 220 FPS. The difference would probably be larger in an application where I was memory bandwidth bound (which I wasn’t with the Crytek Sponza scene), but even with that tiny change in speed (210 to 220 FPS is about 0.4 ms, or in other terms, not even enough to go from 60 to 61 FPS), you have to consider that the change is practically free (not a lot of code changes required).
The question is: Why would planar format be faster when used with shaders that don’t require all attributes?
Well, I have no proof, but I assume it’s because of the way data is fetched from memory. As already mentioned, when your graphics hardware accesses memory, it doesn’t just pull out single floats/integers, or even worse, single bytes. Instead it operates on so called “cache line sizes”, and pulls a large amount of bytes out of memory with a single fetch.
So for example, if that cache line size was 64 bytes and you used only positions out of positions, normals and texture coordinates (3×4 byte , 3×2 byte and 2×2 byte = 22 byte) in an interleaved format, the GPU would pull 3 positions in a single fetch and discard the rest of the data (since you don’t need it). If you used a planar format, where all position attributes are next to each other in memory, the GPU would fetch about 5 position attributes per memory access. Because a decent chunk of memory bandwidth is wasted in the first case (and there’s probably also more memory latency due to more fetches, even in spite of latency hiding mechanisms that GPUs use), the planar layout performs a little better.
This might depend on the used rendering hardware, but on my PC (GTX 770) and laptop (GTX 555m) the results were the same. The results might differ drastically for different architectures, such as consoles or mobile hardware, but somehow I doubt it (can’t think of a reason why it would be different there).
There are other strategies to further reduce memory and bandwidth usage for vertex attributes. For example, as mentioned, it is possible to compute normal tangents and bitangents in the shader as opposed to calculating them “offline” and streaming them to the shader. This might be faster since, usually, GPU ALUs are obscenely fast, while memory (both bandwidth and size) performance hasn’t accelerated at nearly the same rate. This would free up memory bandwidth and size for other rendering passes which dearly need them (such as lighting passes in deferred shading). If you don’t need the extra normal precision, normals may be stored in an 8 bit format. Texture coordinates can be stored in an 8 bit format as well for meshes which only use textures of a size smaller than 256×256.
But alas, I have not found the time to implement and test further changes. Those might be content of future blog posts.
Using smaller data types for appropriate data improved rendering performance on my test hardware (GTX 555m) with the test scene (Crytek Sponza) from 50 to about 80 FPS (~60% increase). No difference was measured on the beefy machine (GTX 770), because the bottlenecks there were in way different places.
Using non-interleaved (planar) vertex attribute layout in memory gave about 0.4ms better performance on a GTX 770 and 0.6ms better performance on GTX 555m.