Memory Management Part 4: Allocators Overhead, Waste and Fragmentation
As seen in my previous post on allocations tracking, tracking memory allocations is an essential tool for any medium to large-scale game engine. Every AAA games I’ve worked on for the last 15 years used similar techniques to track and report memory allocations.
But tracking memory allocations alone is not enough. Under the hood, most allocators works by allocating memory pages (usually 4KB or 64KB pages, sometimes 1MB pages) from the OS and then splitting them into smaller chunks, based of the allocators rules and policies. Managing those pages and chunks requires some memory to be used by the allocator itself, resulting in a certain amount of memory not reported by our tracking system.
Let’s take a look at what types of memory a memory allocator can use.
Most allocation schemes requires a per chunks header (usually between 8 to 24 bytes) to track the size of the allocations and/or some flags and pointers to other chunks. These headers are usually at the beginning of the memory chunks, so the allocator can easily find them when a pointer is passed to free or realloc by going back a couple of bytes.
Some allocators, especially fixed-size memory allocators, will use a per page header (instead of per chunks) that is usually at the beginning or the end of the page, reducing overhead. Given a particular pointer, they simply align the address to the OS page size and get the header for that page, which contains the size of the chunks, the free-list pointer, etc.
The overhead will grow with the number of allocations. For example, if a subsystem is doing a lot of small allocations (especially if the allocation size is near or smaller than the overhead per allocation), this can result in a huge percentage of the subsystem’s memory usage by the overhead. Ideally, such subsystems should allocate bigger chunks of memory and split them instead.
Some space may be wasted in order to satisfy the requirements of the allocators.
Here’s some examples:
- Depending on the required alignment of the returned chunks, some space may be wasted at the beginning of the chunks to satisfy that alignment.
- The allocator’s realloc policy may decide to ignore reallocation entirely if the new space is smaller then the current allocated chunk for optimizations reasons, based on some heuristics (if the new size is not smaller than half of the current size, for instance).
- Fixed-size allocators will return blocks larger than requested. For example, an allocation of 18 bytes could return a chunk size of 32 bytes, if that’s the smaller size that satisfy the allocation size, resulting in a 14 bytes waste.
Like overhead, doing a lot of small allocations can result in a lot of waste, especially when allocating smaller than the allocator’s minimum chunk size.
Inevitably, splitting the OS pages into smaller chunks will result in memory fragmentation. Let’s take a look at this memory report for a single 64KB page:
Each block represents a 256 bytes chunk. Depending on the allocator, chunks can be split and coalesced on demand to provide smaller or bigger chunks of memory. As we can see, there is 133 free blocks (for a total of ~33KB bytes free), but the largest contiguous block of free memory is only ~5KB (5376 bytes).
As a result, any allocation larger than 5376 bytes (minus the allocator’s required overhead) will fail in this page, even if there is approximately 33KB free. This is what we call fragmentation.
Here’s some real examples that I came across during the development of the last Xbox 360 game I’ve worked on.
Page Size Allocations
One of our subsystem was temporarily allocating a lot of 64KB blocks for disk streaming. At first, you might ask yourself why this causes fragmentation, since each allocation would use exactly one OS page. The answer depends on the underlying allocator; in our case, the allocator was adding a 24 bytes header to all allocations, with the result that every allocation was in fact allocating two OS pages! The first page was completely used, while the second one had only 24 bytes used.
Between the time these pages were allocated and freed, the empty space in every other pages was used by other allocations. Then, when those temporary buffers were released by the subsystem, every other page couldn’t be freed since there was other allocations in them. The result was sometimes between 1 to 2MB of fragmented free memory inside these pages.
The solution was to directly allocate OS pages using VirtualAlloc instead, without going through our game engine’s allocators. As a side note, VirtualAlloc was wrapped through a global function to allow tracking.
Depending on the dynamic arrays implementation, adding elements can result is a lot of reallocations. For example, if during the loading process you need to populate a list of all active lights, one might simply add a call to Level->ActiveLights.Add(this) to the light’s loading function. If you have thousand of lights, this means that the dynamic array will possibly need to reallocate itself hundreds of times during the process, leaving bigger and bigger holes in the OS pages scattered all over the place.
A simple solution is to defer this step to the end of the loading process. You simply count the number of active lights, resize your dynamic array once then add all the active lights to it. This can greatly reduce fragmentation.
During the loading process, I noticed that fragmentation was randomly growing around some parts of the code. After several failed attempts to identify the source, I decided to randomly break into the debugger around the time it was happening. I was lucky enough to find out what was the problem in a couple of tries.
The problem was that an external library was doing a lot of small allocations/deallocations, in another thread. After some investigations, I figured out that this library was never allocating more than 256KB at the same time. Since it was heavily allocating and freeing blocks at the same time as the main thread was doing more “permanent” allocations, it was generating a huge amount of fragmentation (around ~2MB) of approximately 8 to 64 bytes blocks scattered all over the OS pages!
The fix to this problem was quite easy to do. Since this library routed all of its allocations through an interface (as described in the first post), I simply routed all of its allocations through a custom free-list that was allocating 64KB OS pages when empty (using VirtualAlloc). This completly fixed the problem, since that library was now working on its own set of OS pages.
Another big source of fragmentation was temporary allocations. The main problem was allocations performed during the loading of the level that were freed at the end of the loading process. Since these temporary allocations were scattered all over the place in the same OS pages as more “permanent” allocations, the fragmentation was quite big (between 10-30MB). Sometimes this meant that, even with ~30MB free, we couldn’t even allocate 32KB of contiguous memory!
The solution to this problem was to provide several allocators:
- permanent: allocations that will never be freed.
- loading: allocations that will be freed after the loading of the level.
- level: allocations that will be freed when unloading the level.
- frame: allocations that will be freed at the end of the frame.
Implementing this solution can be a huge task (adding a parameter to all allocation methods), but it’s worth the effort if your engine suffer from fragmentation.
When I implemented that solution, I discovered a wonderful side-effect: it helped to track memory leaks. Since the allocators were completely destroyed when their lifetime was reached, I reported all allocations that were still inside the allocator (using the tracking system described in a previous post). For example, between levels, if there was still allocations in the level allocator, all allocation were reported (using the same system described in my previous post on allocations tracking), along with the function name of the allocation. This provided a very efficient way to track memory leaks!
Reporting overhead, waste and fragmentation can be tricky, as the best way to get that information is from the allocator itself. If all the allocators from a game engine are custom-made, this is relatively easy to compute, maintain and report.
On the other hand, when using 3rd party allocators like Doug Lea’s dlmalloc, this can be hard to achieve without deep knowledge of the allocator itself. Let’s take a look at the alternatives to keep track of these values without the help of the underlying allocators.
Reporting the approximate overhead can be achieved by estimating the overhead per allocations and manually keeping track of it. For example, if you estimate that your 3rd party allocator add 24 bytes of overhead per allocation, increase the overhead counter by 24 for every allocation. This may not be 100% accurate, but it gives you an idea if you have too many small allocations.
Reporting accurate results for the waste can be achieved if the underlying allocators provides a function that returns the real size of an allocation. For example, if you ask to allocate 18 bytes and the allocator’s GetAllocSize() function returns 32, your memory tracking system can compute that there’s 14 bytes wasted there. This may or may not provide the padding due to alignment, though.
Reporting the approximate fragmentation without the allocators help can be achieved by subtracting all known values from what the OS returns as the process memory usage. In order to achieve high accuracy though, you also need to take into account all threads stack size, the executable size, etc.
This gives us the amount of memory that is currently available by the allocators from all the allocated OS pages, but doesn’t tell you the largest contiguous block available. The solution that I used when generating memory reports was to start by allocating the amount of memory left, and trying smaller and smaller until one worked. This is not the ideal solution, but worked quite well.
Managing memory in large-scale game engines can be a full-time job, especially when dealing with problems like overhead, waste and fragmentation. The key is to always keep an eye on it, by generating accurate reports and reporting/fixing problems as soon as possible.
Hope this post was helpful to someone!