Archive

Archive for the ‘Tutorial’ Category

Memory Management Part 4: Allocators Overhead, Waste and Fragmentation

January 13, 2012 Leave a comment

Introduction

As seen in my previous post on allocations tracking, tracking memory allocations is an essential tool for any medium to large-scale game engine. Every AAA games I’ve worked on for the last 15 years used similar techniques to track and report memory allocations.

But tracking memory allocations alone is not enough. Under the hood, most allocators works by allocating memory pages (usually 4KB or 64KB pages, sometimes 1MB pages) from the OS and then splitting them into smaller chunks, based of the allocators rules and policies. Managing those pages and chunks requires some memory to be used by the allocator itself, resulting in a certain amount of memory not reported by our tracking system.

Let’s take a look at what types of memory a memory allocator can use.

Overhead

Most allocation schemes requires a per chunks header (usually between 8 to 24 bytes) to track the size of the allocations and/or some flags and pointers to other chunks. These headers are usually at the beginning of the memory chunks, so the allocator can easily find them when a pointer is passed to free or realloc by going back a couple of bytes.

Some allocators, especially fixed-size memory allocators, will use a per page header (instead of per chunks) that is usually at the beginning or the end of the page, reducing overhead. Given a particular pointer, they simply align the address to the OS page size and get the header for that page, which contains the size of the chunks, the free-list pointer, etc.

Problematic Case
The overhead will grow with the number of allocations. For example, if a subsystem is doing a lot of small allocations (especially if the allocation size is near or smaller than the overhead per allocation), this can result in a huge percentage of the subsystem’s memory usage by the overhead. Ideally, such subsystems should allocate bigger chunks of memory and split them instead.

Waste

Some space may be wasted in order to satisfy the requirements of the allocators.

Here’s some examples:

  • Depending on the required alignment of the returned chunks, some space may be wasted at the beginning of the chunks to satisfy that alignment.
  • The allocator’s realloc policy may decide to ignore reallocation entirely if the new space is smaller then the current allocated chunk for optimizations reasons, based on some heuristics (if the new size is not smaller than half of the current size, for instance).
  • Fixed-size allocators will return blocks larger than requested. For example, an allocation of 18 bytes could return a chunk size of 32 bytes, if that’s the smaller size that satisfy the allocation size, resulting in a 14 bytes waste.

Problematic Case
Like overhead, doing a lot of small allocations can result in a lot of waste, especially when allocating smaller than the allocator’s minimum chunk size.

Fragmentation

Inevitably, splitting the OS pages into smaller chunks will result in memory fragmentation. Let’s take a look at this memory report for a single 64KB page:

Each block represents a 256 bytes chunk. Depending on the allocator, chunks can be split and coalesced on demand to provide smaller or bigger chunks of memory. As we can see, there is 133 free blocks (for a total of ~33KB bytes free), but the largest contiguous block of free memory is only ~5KB (5376 bytes).

As a result, any allocation larger than 5376 bytes (minus the allocator’s required overhead) will fail in this page, even if there is approximately 33KB free. This is what we call fragmentation.

Here’s some real examples that I came across during the development of the last Xbox 360 game I’ve worked on.

Page Size Allocations
One of our subsystem was temporarily allocating a lot of 64KB blocks for disk streaming. At first, you might ask yourself why this causes fragmentation, since each allocation would use exactly one OS page. The answer depends on the underlying allocator; in our case, the allocator was adding a 24 bytes header to all allocations, with the result that every allocation was in fact allocating two OS pages! The first page was completely used, while the second one had only 24 bytes used.

Between the time these pages were allocated and freed, the empty space in every other pages was used by other allocations. Then, when those temporary buffers were released by the subsystem, every other page couldn’t be freed since there was other allocations in them. The result was sometimes between 1 to 2MB of fragmented free memory inside these pages.

The solution was to directly allocate OS pages using VirtualAlloc instead, without going through our game engine’s allocators. As a side note, VirtualAlloc was wrapped through a global function to allow tracking.

Dynamic Arrays
Depending on the dynamic arrays implementation, adding elements can result is a lot of reallocations. For example, if during the loading process you need to populate a list of all active lights, one might simply add a call to Level->ActiveLights.Add(this) to the light’s loading function. If you have thousand of lights, this means that the dynamic array will possibly need to reallocate itself hundreds of times during the process, leaving bigger and bigger holes in the OS pages scattered all over the place.

A simple solution is to defer this step to the end of the loading process. You simply count the number of active lights, resize your dynamic array once then add all the active lights to it. This can greatly reduce fragmentation.

Threaded Allocations
During the loading process, I noticed that fragmentation was randomly growing around some parts of the code. After several failed attempts to identify the source, I decided to randomly break into the debugger around the time it was happening. I was lucky enough to find out what was the problem in a couple of tries.

The problem was that an external library was doing a lot of small allocations/deallocations, in another thread. After some investigations, I figured out that this library was never allocating more than 256KB at the same time. Since it was heavily allocating and freeing blocks at the same time as the main thread was doing more “permanent” allocations, it was generating a huge amount of fragmentation (around ~2MB) of approximately 8 to 64 bytes blocks scattered all over the OS pages!

The fix to this problem was quite easy to do. Since this library routed all of its allocations through an interface (as described in the first post), I simply routed all of its allocations through a custom free-list that was allocating 64KB OS pages when empty (using VirtualAlloc). This completly fixed the problem, since that library was now working on its own set of OS pages.

Temporary Allocations
Another big source of fragmentation was temporary allocations. The main problem was allocations performed during the loading of the level that were freed at the end of the loading process. Since these temporary allocations were scattered all over the place in the same OS pages as more “permanent” allocations, the fragmentation was quite big (between 10-30MB). Sometimes this meant that, even with ~30MB free, we couldn’t even allocate 32KB of contiguous memory!

The solution to this problem was to provide several allocators:

  • permanent: allocations that will never be freed.
  • loading: allocations that will be freed after the loading of the level.
  • level: allocations that will be freed when unloading the level.
  • frame: allocations that will be freed at the end of the frame.

Implementing this solution can be a huge task (adding a parameter to all allocation methods), but it’s worth the effort if your engine suffer from fragmentation.

When I implemented that solution, I discovered a wonderful side-effect: it helped to track memory leaks. Since the allocators were completely destroyed when their lifetime was reached, I reported all allocations that were still inside the allocator (using the tracking system described in a previous post). For example, between levels, if there was still allocations in the level allocator, all allocation were reported (using the same system described in my previous post on allocations tracking), along with the function name of the allocation. This provided a very efficient way to track memory leaks!

Reporting

Reporting overhead, waste and fragmentation can be tricky, as the best way to get that information is from the allocator itself. If all the allocators from a game engine are custom-made, this is relatively easy to compute, maintain and report.

On the other hand, when using 3rd party allocators like Doug Lea’s dlmalloc, this can be hard to achieve without deep knowledge of the allocator itself. Let’s take a look at the alternatives to keep track of these values without the help of the underlying allocators.

Overhead
Reporting the approximate overhead can be achieved by estimating the overhead per allocations and manually keeping track of it. For example, if you estimate that your 3rd party allocator add 24 bytes of overhead per allocation, increase the overhead counter by 24 for every allocation. This may not be 100% accurate, but it gives you an idea if you have too many small allocations.

Waste
Reporting accurate results for the waste can be achieved if the underlying allocators provides a function that returns the real size of an allocation. For example, if you ask to allocate 18 bytes and the allocator’s GetAllocSize() function returns 32, your memory tracking system can compute that there’s 14 bytes wasted there. This may or may not provide the padding due to alignment, though.

Fragmentation
Reporting the approximate fragmentation without the allocators help can be achieved by subtracting all known values from what the OS returns as the process memory usage. In order to achieve high accuracy though, you also need to take into account all threads stack size, the executable size, etc.

This gives us the amount of memory that is currently available by the allocators from all the allocated OS pages, but doesn’t tell you the largest contiguous block available. The solution that I used when generating memory reports was to start by allocating the amount of memory left, and trying smaller and smaller until one worked. This is not the ideal solution, but worked quite well.

Conclusion

Managing memory in large-scale game engines can be a full-time job, especially when dealing with problems like overhead, waste and fragmentation. The key is to always keep an eye on it, by generating accurate reports and reporting/fixing problems as soon as possible.

Hope this post was helpful to someone!

Memory Management Part 3: Memory Allocators Design

October 22, 2011 4 comments

Introduction
Writing a simple memory allocator is quite easy. Writing an efficient allocator that tries to minimize fragmentation, overhead and locking while trying to maximize locality of reference and performance is an extremely hard task, even for experienced programmers.

In this post, I’ll cover the basics of designing a simple memory allocator. The technique that will be used is called heap layers which is desribed in the “Composing High-Performance Memory Allocators” excellent paper.

Essentially, what the authors propose is to create a set of elementary layers that perform simple operations on the memory they allocates, and compose those layers together to create complex allocators. Instead of using virtual calls to forward calls to top layers, they use C++ mixins, described in the “Mixin-Based Programming in C++” paper. Mixins are templated classes that can have any parent class, making it possible to combine layers and have inlining when calling parent functions.

For example, Doug Lea’s dlmalloc is a very popular allocator and widely used in various software. Understanding the 2000+ lines of code to be able to modify it is hard to achieve and is error-prone. On the other hand, the authors of the heap layers paper have reproduced dlmalloc‘s behavior using a couple of layers, in around 500 lines of code, with similar performances and memory usage. The main advantage of this technique is that it makes it really easy to test new changes to the allocator.

The allocator that will be described in this post will look like this:

typedef StrictSegHeap<
    10, 
    StrictSegHeapTraits,
    ThreadSafeHeap<PagedFreeList>,
    ThreadSafeHeap<CRTMallocHeap>
> GlobalHeap;

A strict segregated heap of 10 fixed-sizes paged free-lists for small allocations, using CRT malloc for large allocations, with a lock per size instead of a single lock to minimize locking. The layers provided here are for the purpose of the post, so readability and ease of understanding were traded with performance in the majority of the cases.

Let’s get started!

Basic Allocator
We start by writing a simple memory allocator that gets it’s memory from CRT malloc:

class CRTMallocHeap
{
public:
    inline void* Malloc(tU32 Size)
    {
        return malloc(Size);
    }

    inline void* Realloc(void* Ptr, tU32 Size)
    {
        return realloc(Ptr, Size);
    }

    inline void Free(void* Ptr)
    {
        return free(Ptr);
    }
};

Then, we plug this into our 3 memory functions described in previous posts (see highlighted lines):

typedef CRTMallocHeap GlobalHeap;

GlobalHeap GHeap;

void* _Malloc(tU32 Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line)
{
    void* Result = GHeap.Malloc(Size);
    RegisterAlloc(Result, Size, AllocType, Desc, File, Line);
    return Result;
}

void* _Realloc(void* Ptr, tU32 Size, const tChar* File, tU32 Line)
{
    void* Result = GHeap.Realloc(Ptr, Size);
    UpdateAlloc(Ptr, Result, Size, File, Line);
    return Result;
}

void _Free(void* Ptr)
{
    UnregisterAlloc(Ptr);
    return GHeap.Free(Ptr);
}

We are now ready to implement our first memory allocator layer.

Free List Allocator
Remember the FreeList class desribed in the second post? It essentially allocates memory pages from the OS using VirtualAlloc and split them into fixed-size chunks. Then, all allocations that fits into that chunk size can be allocated using the free-list extremely fast.

Let’s write a heap layer for that:

class PagedFreeList
{
    tU8* FirstFree;

public:
    PagedFreeList()
    : FirstFree(NULL)
    {}

    inline void* Malloc(tU32 Size)
    {
        if(!FirstFree)
        {
            const tU32 PageSize = 65536;
            const tU32 NumAllocPerBatch = PageSize / Size;

            // Allocate a 64K page from the OS
            tU8* AllocBatch = (tU8*)VirtualAlloc(NULL, PageSize, MEM_COMMIT, PAGE_READWRITE);

            for(tU32 i=0; i<NumAllocPerBatch; i++)
            {
                Free(AllocBatch);
                AllocBatch += Size;
            }
        }

        tU8* Result = FirstFree;
        FirstFree = *((tU8**)FirstFree);
        return Result;
    }

    inline void Free(void* Ptr)
    {
        *(tU8**)Ptr = FirstFree;
        FirstFree = (tU8*)Ptr;
    }
};

Size Heap
Some heaps need to know the size of an allocation in order to route this allocation to the good super-heap. A quick way to achieve that is to write a simple layer that allocates a small header along with each allocation and store the size of the allocation into that header:

template<class SuperHeap> class SizeHeap: public SuperHeap
{
    struct FreeObject
    {
        tU32 Size;
    };

public:
    inline void* Malloc(tU32 Size)
    {
        FreeObject* Ptr = (FreeObject*)SuperHeap::Malloc(Size + sizeof(FreeObject));
        if(Ptr)
        {
            Ptr->Size = Size;
        }
        return (void*)(Ptr + 1);
    }

    inline void* Realloc(void* Ptr, tU32 Size)
    {
        FreeObject* NewPtr = (FreeObject*)SuperHeap::Realloc(Ptr ? (FreeObject*)Ptr - 1 : NULL, Size + sizeof(FreeObject));
        if(NewPtr)
        {
            NewPtr->Size = Size;
        }
        return (void*)(NewPtr + 1);
    }

    inline void Free(void* Ptr)
    {
        SuperHeap::Free((FreeObject*)Ptr - 1);
    }

    inline tU32 GetSize(void* Ptr)
    {
        return ((FreeObject*)Ptr - 1)->Size;
    }
};

This heap requires 4 bytes per allocation. Other schemes could be used to minimize this overhead; for example, the PagedFreeList could implement GetSize() by keeping track of the page allocation size and return this value, as a page always contains same size allocations.

ThreadSafe Heap
If we want to add thread-safety to our layers, we simply need to implement a ThreadSafeHeap layer that provides a Lock() and Unlock() functions, like this:

template<class SuperHeap> class ThreadSafeHeap: public SuperHeap
{
    volatile tS32 LockSem;

    inline void Lock()
    {
        while(1)
        {
            // Atomically swap the lock variable with 1 if it's currently equal to 0
            if(!InterlockedCompareExchange(&LockSem, 1, 0))
            {
                // We successfully acquired the lock
                MemoryBarrier();
                return;
            }

            YieldProcessor();
        }
    }

    inline void Unlock()
    {
        MemoryBarrier();
        LockSem = 0;
    }

public:
    ThreadSafeHeap()
    : LockSem(0)
    {}

    inline void* Malloc(tU32 Size)
    {
        Lock();
        void* Result = SuperHeap::Malloc(Size);
        Unlock();
        return Result;
    }

    inline void* Realloc(void* Ptr, tU32 Size)
    {
        Lock();
        void* Result = SuperHeap::Realloc(Ptr, Size);
        Unlock();
        return Result;
    }

    inline void Free(void* Ptr)
    {
        Lock();
        SuperHeap::Free(Ptr);
        Unlock();
    }
};

Then, adding this layer on top of another one makes it thread-safe. Simply by rearranging where we place this layer can change from a slow, single lock:

typedef ThreadSafeHeap<
    StrictSegHeap<
        10,
        StrictSegHeapTraits,
        PagedFreeList,
        CRTMallocHeap
    >
> GlobalHeap;

to a lock per-layer:

typedef StrictSegHeap<
    10, 
    StrictSegHeapTraits,
    ThreadSafeHeap<PagedFreeList>,
    ThreadSafeHeap<CRTMallocHeap>
> GlobalHeap;

Using a single lock will block whenever two or more threads uses the allocator at the same time. On the other hand, using a lock per layer as shown in the last code sample will reduce the chances of locking only if two or more threads allocates the same size range at the same time.

Strict Segregated Heap
The last layer that will be shown is a strict segregated layer: a layer that separates allocations by sizes using sub-heaps. This layer is strict because it doesn’t perform blocks splitting or coalescing (as dlmalloc does, for instance).

First, we need to define how the sizes will be separated. For this example, I chosed simple powers of two between 8 and 4096:

1..8
9..16
17..32
33..64
65..128
129..256
257..512
513..1024
1025..2048
2049..4096

Which translates into code like this:

struct StrictSegHeapTraits
{
    // Return the bin that this size falls into
    static inline tU32 GetSizeClass(tU32 Size)
    {
        INT c = 0;
        Size--;
        while(Size > 7)
        {
            Size >>= 1;
            c++;
        }
        return c;
    }

    // Return the size of allocations in this bin
    static inline tU32 GetClassMaxSize(tU32 Bin)
    {
        return 1 << (Bin + 3);
    }
};

And the strict segregated heap looks like this:

template<tU32 NumBins, class Traits, class LittleHeap, class BigHeap> class StrictSegHeap: public BigHeap
{
    // The array of little heaps (bins), one for each allocation size
    LittleHeap LittleHeap[NumBins];

    // Return the bin that this allocation falls into
    inline tu32 InnerGetSizeClass(tu32 Size)
    {
        if(Size > Traits::GetClassMaxSize(NumBins - 1))
        {
            return NumBins;
        }
        return Traits::GetSizeClass(Size);
    }

public:
    StrictSegHeap()
    {}

    inline void* Malloc(tU32 Size)
    {
        tU32 SizeClass = InnerGetSizeClass(Size);

        if(SizeClass >= NumBins)
        {
            // Allocation is too big, route to big heap
            return BigHeap::Malloc(Size);
        }

        tU32 MaxSizeClass = Traits::GetClassMaxSize(SizeClass);
        return LittleHeap[SizeClass].Malloc(MaxSizeClass);
    }

    inline void* Realloc(void* Ptr, tU32 Size)
    {
        if(Ptr)
        {
            if(Size)
            {
                tU32 ObjectSize = GetSize(Ptr);
                tU32 ObjectSizeClass = InnerGetSizeClass(ObjectSize);

                if(ObjectSizeClass >= NumBins)
                {
                    return BigHeap::Realloc(Ptr, Size);
                }

                // Loose reallocation: only realloc if bigger or at least twice smaller
                if((Size > ObjectSize) || (Size < ObjectSize / 2))
                {
                    void* NewPtr = Malloc(Size);

                    if(NewPtr)
                    {
                        tU32 CopySize = Min(ObjectSize, Size);
                        memcpy(NewPtr, Ptr, CopySize);
                    }

                    LittleHeap[ObjectSizeClass].Free(Ptr);

                    return NewPtr;
                }

                return Ptr;
            }

            Free(Ptr);
            return NULL;
        }

        return Malloc(Size);
    }

    inline void Free(void* Ptr)
    {
        if(Ptr)
        {
            INT ObjectSize = GetSize(Ptr);
            INT ObjectSizeClass = InnerGetSizeClass(ObjectSize);

            if(ObjectSizeClass >= NumBins)
            {
                BigHeap::Free(Ptr);
            }
            else
            {
                LittleHeap[ObjectSizeClass].Free(Ptr);
            }
        }
    }
};

This heap layer routes small allocations into their corresponding sub-heaps and big allocations into the big heap. As we can see, the code relies on the fact that the sub-heaps provide the GetSize() function to know the size of incoming allocations for Realloc and Free, so it has to be provided or the code won’t compile.

Conclusion
Using the heap layers technique, we can now implement and test new layers quite easily to compose complex allocators easily. There’s a lot of utility layers that can be written; for example, a debug layer that keeps a header and a footer for each allocations to detect memory overwrites. Even the memory tracking system described in the second post could be written as a layer. Make sure to read the provided papers if you read that far!

Memory Management Part 2: Allocations Tracking

October 6, 2011 7 comments

Introduction
In this post, I’ll cover the basics of memory allocations tracking. If you haven’t read my previous post on allocation strategy, make sure to read it before continuing.

Here’s what we want to achieve:

  • Keep track of every single allocation in the game with its allocation source (file, line, etc).
  • Categorize every allocation into high-level categories, such as Textures, Meshes, AI, etc.
  • Track memory leaks and generate categorized memory usage reports.
  • Minimize the impacts on the game’s performance, memory usage and fragmentation.

Minimizing Impacts
Keeping track of all allocations in a game engine might use a large amount of memory by itself; when tracking memory for usage reports or leaks, this may not be a problem. On the other hand, when using the system to gather information on fragmentation, it is crucial that it has a minimum impact on memory usage, or it may alter the results.

How to achieve that? The obvious answer is that the tracking system should not allocate its memory from the game engine’s allocators. Here we introduce the FreeList class:

template<class T> class FreeList
{
    T* Free;

public:
    FreeList()
    : Free(NULL)
    {}

    inline T* New()
    {
        if(!Free)
        {
            const tU32 AllocSize = 65536;
            const tU32 NumAllocPerBatch = AllocSize / sizeof(T);

            T* AllocBatch = (T*)VirtualAlloc(NULL, AllocSize, MEM_COMMIT, PAGE_READWRITE);

            for(tU32 i=0; i<NumAllocPerBatch; i++)
            {
                Delete(AllocBatch++);
            }
        }

        T* Result = Free;
        Free = *((T**)Free);
        return Result;
    }

    inline void Delete(T* Ptr)
    {
        *(T**)Ptr = Free;
        Free = Ptr;
    }
};

A free-list is a very important concept when designing memory allocators. The particularity of this implementation is that it will allocates 64KB pages directly from the OS whenever it is empty. It then uses the first bytes of each free elements that fits in the page to link to other free elements.

The fact that it allocates directly from the OS ensure that it doesn’t interfere with normal memory allocations in the engine, and minimize fragmentation since it allocates contiguous elements in OS pages.

In heavy usage scenarios, a FreeList will link free elements between multiple pages, but still it shouldn’t affect fragmentation, since it doesn’t share the same OS pages as the game engine’s allocators.

Tracking Allocations Hook
Now we’re ready to track allocations. The first step is to “hijack” our 3 memory functions we defined in the first part (lines 4, 11 and 17):

void* _Malloc(tU32 Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line)
{
    void* Result = malloc(Size);
    RegisterAlloc(Result, Size, AllocType, Desc, File, Line);
    return Result;
}

void* _Realloc(void* Ptr, tU32 Size, const tChar* File, tU32 Line)
{
    void* Result = realloc(Ptr, Size);
    UpdateAlloc(Ptr, Result, Size, File, Line);
    return Result;
}

void _Free(void* Ptr)
{
    UnregisterAlloc(Ptr);
    return free(Ptr);
}

Implementation
Next, we need to implement RegisterAlloc, UpdateAlloc and UnregisterAlloc. The first step is to define the allocation structure AllocDesc that will hold information on a single allocation, and the allocation category TagDesc that will hold the allocation’s category information, such as the category’s name (the Desc string provided with each allocation call), the total allocated size, etc:

struct AllocDesc
{
    void* Ptr;        // Allocation address
    tU32 Size;        // Allocation size
    TagDesc* Tag;     // Pointer to the allocation category
    AllocDesc* Next;  // Next allocation in the hashtable
}
struct TagDesc
{
    tU32 CRC;         // Category description string CRC
    tU32 Size;        // Total size of all allocations in this category
    tChar Tag[128];   // Category name
    TagDesc* Next;    // Next category in the hashtable
}

The steps for registering an allocation looks like this:

  1. Create a new AllocDesc from a FreeList<AllocDesc> (line 4).
  2. Register this new allocation into its category (line 13).
  3. Register the new AllocDesc into the hashtable (lines 16-18).
void RegisterAlloc(void* Ptr, tU32 Size, const tChar* Desc, const tChar* File, tU32 Line)
{
    // Allocate a new AllocDesc from the freelist
    AllocDesc* NewAllocDesc = FreeAllocs.Allocate();

    // Set the allocation info
    NewAllocDesc->Ptr = Ptr;
    NewAllocDesc->Size = Size;
    NewAllocDesc->File = File;
    NewAllocDesc->Line = Line;

    // Register the allocation into its category (this updates the category total size)
    NewAllocDesc->Tag = TagDesc::Register(Desc, Size);

    // Register the allocation in the fixed-size hashtable
    tU32 PtrHash = GetPtrHash(Ptr) & (ALLOC_DESC_SIZE - 1);
    NewAllocDesc->Next = Hash[PtrHash];
    Hash[PtrHash] = NewAllocDesc;
}

Registering a new category goes like this:

  1. Lookup in the hashtable for the category (lines 3-17).
  2. If it already exist, increment the category’s size by the allocation size (line 12).
  3. If it doesn’t exist, create a new one from a FreeList<TagDesc> (line 20).
  4. Register the new TagDesc into the hashtable (lines 31-32).
static TagDesc* TagDesc::Register(const tChar* Tag, tU32 Size)
{
    tU32 TagCRC = Strhash(Tag);
    tU32 TagHash = TagCRC & (ALLOC_DESC_SIZE - 1);
    TagDesc* CurTag = CurTag = Hash[TagHash];

    while(CurTag)
    {
        if(CurTag->CRC == TagCRC)
        {
            // This category already exist, update its size and return it
            CurTag->Size += Size;
            return CurTag;
        }

        CurTag = CurTag->Next;
    }

    // Allocate a new TagDesc from the freelist
    CurTag = FreeTags.Allocate();

    // Set the category info
    CurTag->CRC = TagCRC;
    CurTag->Size = Size;

    tU32 TagLen = Min(strlen(Tag), 127);
    memcpy(CurTag->Tag, Tag, TagLen);
    CurTag->Tag[TagLen] = 0;

    // Register the category in the fixed-size hashtable
    CurTag->Next = Hash[TagHash];
    Hash[TagHash] = CurTag;

    return CurTag;
}

UpdateAlloc and UnregisterAlloc are straightforward. Simply update the allocation and category accordingly.

Analyzing the Data
Then comes the fun part: dumping and analyzing the gathered information. I usually dump all the data into a comma-separated values compatible format, because I tend to use Excel to analyze the data:

void DumpAllocs()
{
    printf("Address,Category,CategorySize,AllocSize\n");
    for(tU32 i=0; i<ALLOC_DESC_SIZE; i++)
    {
        AllocDesc* Alloc = Hash[i];
        while(Alloc)
        {
            printf("0x%08x,%s,%d,%d\n", Alloc->Ptr, Alloc->Tag->Tag, Alloc->Tag->Size, Alloc->Size);
            Alloc = Alloc->Next;
        }
    }
}

Which looks like this when turned into a quick report in Excel:

  Meshes 11336K
  Sounds 41553K
Textures 7867K

Tracking memory leaks
With this memory tracking up and running, it’s easy to setup a quick memory leaks detection system by dumping all allocations at a certain point in the game, then dumping another snapshot later on. Then, by comparing the 2 snapshots, we can quickly show the deltas for each categories and/or even for all allocations. The tricky part is to analyze the data correctly and identify what is normal and what are the real memory leaks.

Conclusion
This essentially covers the basics of memory allocations tracking. A lot of advanced stuff can be added, like tracking memory allocations using the call-stack, on-screen and even over-the-network memory visualization, etc. Simply implement what suits your needs best!

Memory Management Part 1: Introduction

October 3, 2011 4 comments

This is the first of a series of posts on memory management in a game engine.

Series Outline
The topics I’d like to cover are numerous and may changes depending on the response I get from previous posts. Among others, I will cover (in no particular order):

  • Allocation strategy
  • Usage tracking
  • Leaks detection
  • Allocators design
  • Multithread issues
  • Fragmentation
  • False sharing
  • Contention
  • Waste/overhead

Motivation
The memory management strategy has important effects on any game engine. Performances of the allocator(s) is important; memory leaks, overhead and fragmentation are hard to report and even harder to fix. The need for good tools is critical.

Allocation Strategy
In order to be able to track memory allocations effectively and provide services like memory leaks detection and memory usage reports, it is crucial that all allocations go through a single point in the engine. This ensure that everything is taken care of, and reduce to a minimum the unknown memory usage.

To do so, we define the only 3 functions that should be used when allocating memory:

void* _Malloc(tU32 Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line);
void* _Realloc(void* Ptr, tU32 Size, const tChar* File, tU32 Line);
void _Free(void* Ptr);

That’s a good start, but the new and delete operators should also be overloaded:

inline void* operator new(size_t Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line) { return _Malloc(Size, AllocType, Desc, File, Line); }
inline void* operator new[](size_t Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line) { return _Malloc(Size, AllocType, Desc, File, Line); }
inline void operator delete(void* Ptr) { _Free(Ptr); }
inline void operator delete[](void* Ptr) { _Free(Ptr); }

Initially, routing these functions to CRT‘s malloc, realloc and free should be fine. In future posts we’ll implement those using our own allocators.

Notice that every allocation has to provide a description string, file and line number. This will be used to track allocations, generate reports on memory usage/fragmentation and detect memory leaks.

Also notice the AllocType parameter: this is to provide a hint on which allocator to use to perform this allocation. This will be covered in future posts, such as memory fragmentation reduction.

It could be tempting to allow allocating without this information (by using default parameters), but this is something I highly recommend against: being rigorous about this will pay at the end, since it could rapidly become a mess if we let people allocate without providing accurate information and would render sub-systems such as memory usage/fragmentation/leaks tracking unusable.

To reduce the code overhead, we can define some convenient helper macros that automatically provide source file and line numbers:

#define Malloc(SIZE, TYPE, DESC) _Malloc(SIZE, TYPE, DESC, __FILE__, __LINE__)
#define Realloc(PTR, SIZE) _Realloc(PTR, SIZE, __FILE__, __LINE__)
#define New(CLASS, TYPE, DESC) new(TYPE, DESC, __FILE__, __LINE__) CLASS
#define Delete delete

At this point, all allocations through the engine should be converted to the new wrappers we just wrote. Might not an easy task depending on the size of the codebase, but it’s worth it.

Third-party external libraries
What about third-party external libraries? Any professional libraries should come with the possibility to route their internal memory allocations. If you’re using a library that doesn’t allow that, stop using it or contact the support so they provide such functionality.

Usually, third-party libraries provide either function definitions that the user need to implement (such as Box2D‘s b2Alloc/b2Free) or an interface similar to this:

class MemoryInterface
{
public:
    virtual void* malloc(int Size)=0;
    virtual void* realloc(void* Ptr, int Size)=0;
    virtual void free(void* Ptr)=0;
};

Implementing this interface should look like this:

class EngineMemoryInterface: public MemoryInterface
{
public:
    void* malloc(int Size) { return Malloc(Size, 0, "ExternalLibraryName"); }
    void* realloc(void* Ptr, int Size) { return Realloc(Ptr, Size); }
    void free(void* Ptr) { return Free(Ptr); }
};

If the library also provides internal file and line number of the allocations, simply route them too.

Conclusion
Congratulations, you wrapped all allocations to your own memory allocator stubs. This is only the first step, but a major one. To make sure everything is still under control, I usually put a breakpoint in CRT‘s malloc and run the game once in a while. Simply open the breakpoints window, type ‘malloc’ in the function name and play the game. Should not hit the breakpoint at all.

Now that we’re all set up, we’re ready to add functionality such as memory leaks tracking, etc. See you in a next post!

%d bloggers like this: