Archive

Archive for October, 2011

Memory Management Part 3: Memory Allocators Design

October 22, 2011 4 comments

Introduction
Writing a simple memory allocator is quite easy. Writing an efficient allocator that tries to minimize fragmentation, overhead and locking while trying to maximize locality of reference and performance is an extremely hard task, even for experienced programmers.

In this post, I’ll cover the basics of designing a simple memory allocator. The technique that will be used is called heap layers which is desribed in the “Composing High-Performance Memory Allocators” excellent paper.

Essentially, what the authors propose is to create a set of elementary layers that perform simple operations on the memory they allocates, and compose those layers together to create complex allocators. Instead of using virtual calls to forward calls to top layers, they use C++ mixins, described in the “Mixin-Based Programming in C++” paper. Mixins are templated classes that can have any parent class, making it possible to combine layers and have inlining when calling parent functions.

For example, Doug Lea’s dlmalloc is a very popular allocator and widely used in various software. Understanding the 2000+ lines of code to be able to modify it is hard to achieve and is error-prone. On the other hand, the authors of the heap layers paper have reproduced dlmalloc‘s behavior using a couple of layers, in around 500 lines of code, with similar performances and memory usage. The main advantage of this technique is that it makes it really easy to test new changes to the allocator.

The allocator that will be described in this post will look like this:

typedef StrictSegHeap<
    10, 
    StrictSegHeapTraits,
    ThreadSafeHeap<PagedFreeList>,
    ThreadSafeHeap<CRTMallocHeap>
> GlobalHeap;

A strict segregated heap of 10 fixed-sizes paged free-lists for small allocations, using CRT malloc for large allocations, with a lock per size instead of a single lock to minimize locking. The layers provided here are for the purpose of the post, so readability and ease of understanding were traded with performance in the majority of the cases.

Let’s get started!

Basic Allocator
We start by writing a simple memory allocator that gets it’s memory from CRT malloc:

class CRTMallocHeap
{
public:
    inline void* Malloc(tU32 Size)
    {
        return malloc(Size);
    }

    inline void* Realloc(void* Ptr, tU32 Size)
    {
        return realloc(Ptr, Size);
    }

    inline void Free(void* Ptr)
    {
        return free(Ptr);
    }
};

Then, we plug this into our 3 memory functions described in previous posts (see highlighted lines):

typedef CRTMallocHeap GlobalHeap;

GlobalHeap GHeap;

void* _Malloc(tU32 Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line)
{
    void* Result = GHeap.Malloc(Size);
    RegisterAlloc(Result, Size, AllocType, Desc, File, Line);
    return Result;
}

void* _Realloc(void* Ptr, tU32 Size, const tChar* File, tU32 Line)
{
    void* Result = GHeap.Realloc(Ptr, Size);
    UpdateAlloc(Ptr, Result, Size, File, Line);
    return Result;
}

void _Free(void* Ptr)
{
    UnregisterAlloc(Ptr);
    return GHeap.Free(Ptr);
}

We are now ready to implement our first memory allocator layer.

Free List Allocator
Remember the FreeList class desribed in the second post? It essentially allocates memory pages from the OS using VirtualAlloc and split them into fixed-size chunks. Then, all allocations that fits into that chunk size can be allocated using the free-list extremely fast.

Let’s write a heap layer for that:

class PagedFreeList
{
    tU8* FirstFree;

public:
    PagedFreeList()
    : FirstFree(NULL)
    {}

    inline void* Malloc(tU32 Size)
    {
        if(!FirstFree)
        {
            const tU32 PageSize = 65536;
            const tU32 NumAllocPerBatch = PageSize / Size;

            // Allocate a 64K page from the OS
            tU8* AllocBatch = (tU8*)VirtualAlloc(NULL, PageSize, MEM_COMMIT, PAGE_READWRITE);

            for(tU32 i=0; i<NumAllocPerBatch; i++)
            {
                Free(AllocBatch);
                AllocBatch += Size;
            }
        }

        tU8* Result = FirstFree;
        FirstFree = *((tU8**)FirstFree);
        return Result;
    }

    inline void Free(void* Ptr)
    {
        *(tU8**)Ptr = FirstFree;
        FirstFree = (tU8*)Ptr;
    }
};

Size Heap
Some heaps need to know the size of an allocation in order to route this allocation to the good super-heap. A quick way to achieve that is to write a simple layer that allocates a small header along with each allocation and store the size of the allocation into that header:

template<class SuperHeap> class SizeHeap: public SuperHeap
{
    struct FreeObject
    {
        tU32 Size;
    };

public:
    inline void* Malloc(tU32 Size)
    {
        FreeObject* Ptr = (FreeObject*)SuperHeap::Malloc(Size + sizeof(FreeObject));
        if(Ptr)
        {
            Ptr->Size = Size;
        }
        return (void*)(Ptr + 1);
    }

    inline void* Realloc(void* Ptr, tU32 Size)
    {
        FreeObject* NewPtr = (FreeObject*)SuperHeap::Realloc(Ptr ? (FreeObject*)Ptr - 1 : NULL, Size + sizeof(FreeObject));
        if(NewPtr)
        {
            NewPtr->Size = Size;
        }
        return (void*)(NewPtr + 1);
    }

    inline void Free(void* Ptr)
    {
        SuperHeap::Free((FreeObject*)Ptr - 1);
    }

    inline tU32 GetSize(void* Ptr)
    {
        return ((FreeObject*)Ptr - 1)->Size;
    }
};

This heap requires 4 bytes per allocation. Other schemes could be used to minimize this overhead; for example, the PagedFreeList could implement GetSize() by keeping track of the page allocation size and return this value, as a page always contains same size allocations.

ThreadSafe Heap
If we want to add thread-safety to our layers, we simply need to implement a ThreadSafeHeap layer that provides a Lock() and Unlock() functions, like this:

template<class SuperHeap> class ThreadSafeHeap: public SuperHeap
{
    volatile tS32 LockSem;

    inline void Lock()
    {
        while(1)
        {
            // Atomically swap the lock variable with 1 if it's currently equal to 0
            if(!InterlockedCompareExchange(&LockSem, 1, 0))
            {
                // We successfully acquired the lock
                MemoryBarrier();
                return;
            }

            YieldProcessor();
        }
    }

    inline void Unlock()
    {
        MemoryBarrier();
        LockSem = 0;
    }

public:
    ThreadSafeHeap()
    : LockSem(0)
    {}

    inline void* Malloc(tU32 Size)
    {
        Lock();
        void* Result = SuperHeap::Malloc(Size);
        Unlock();
        return Result;
    }

    inline void* Realloc(void* Ptr, tU32 Size)
    {
        Lock();
        void* Result = SuperHeap::Realloc(Ptr, Size);
        Unlock();
        return Result;
    }

    inline void Free(void* Ptr)
    {
        Lock();
        SuperHeap::Free(Ptr);
        Unlock();
    }
};

Then, adding this layer on top of another one makes it thread-safe. Simply by rearranging where we place this layer can change from a slow, single lock:

typedef ThreadSafeHeap<
    StrictSegHeap<
        10,
        StrictSegHeapTraits,
        PagedFreeList,
        CRTMallocHeap
    >
> GlobalHeap;

to a lock per-layer:

typedef StrictSegHeap<
    10, 
    StrictSegHeapTraits,
    ThreadSafeHeap<PagedFreeList>,
    ThreadSafeHeap<CRTMallocHeap>
> GlobalHeap;

Using a single lock will block whenever two or more threads uses the allocator at the same time. On the other hand, using a lock per layer as shown in the last code sample will reduce the chances of locking only if two or more threads allocates the same size range at the same time.

Strict Segregated Heap
The last layer that will be shown is a strict segregated layer: a layer that separates allocations by sizes using sub-heaps. This layer is strict because it doesn’t perform blocks splitting or coalescing (as dlmalloc does, for instance).

First, we need to define how the sizes will be separated. For this example, I chosed simple powers of two between 8 and 4096:

1..8
9..16
17..32
33..64
65..128
129..256
257..512
513..1024
1025..2048
2049..4096

Which translates into code like this:

struct StrictSegHeapTraits
{
    // Return the bin that this size falls into
    static inline tU32 GetSizeClass(tU32 Size)
    {
        INT c = 0;
        Size--;
        while(Size > 7)
        {
            Size >>= 1;
            c++;
        }
        return c;
    }

    // Return the size of allocations in this bin
    static inline tU32 GetClassMaxSize(tU32 Bin)
    {
        return 1 << (Bin + 3);
    }
};

And the strict segregated heap looks like this:

template<tU32 NumBins, class Traits, class LittleHeap, class BigHeap> class StrictSegHeap: public BigHeap
{
    // The array of little heaps (bins), one for each allocation size
    LittleHeap LittleHeap[NumBins];

    // Return the bin that this allocation falls into
    inline tu32 InnerGetSizeClass(tu32 Size)
    {
        if(Size > Traits::GetClassMaxSize(NumBins - 1))
        {
            return NumBins;
        }
        return Traits::GetSizeClass(Size);
    }

public:
    StrictSegHeap()
    {}

    inline void* Malloc(tU32 Size)
    {
        tU32 SizeClass = InnerGetSizeClass(Size);

        if(SizeClass >= NumBins)
        {
            // Allocation is too big, route to big heap
            return BigHeap::Malloc(Size);
        }

        tU32 MaxSizeClass = Traits::GetClassMaxSize(SizeClass);
        return LittleHeap[SizeClass].Malloc(MaxSizeClass);
    }

    inline void* Realloc(void* Ptr, tU32 Size)
    {
        if(Ptr)
        {
            if(Size)
            {
                tU32 ObjectSize = GetSize(Ptr);
                tU32 ObjectSizeClass = InnerGetSizeClass(ObjectSize);

                if(ObjectSizeClass >= NumBins)
                {
                    return BigHeap::Realloc(Ptr, Size);
                }

                // Loose reallocation: only realloc if bigger or at least twice smaller
                if((Size > ObjectSize) || (Size < ObjectSize / 2))
                {
                    void* NewPtr = Malloc(Size);

                    if(NewPtr)
                    {
                        tU32 CopySize = Min(ObjectSize, Size);
                        memcpy(NewPtr, Ptr, CopySize);
                    }

                    LittleHeap[ObjectSizeClass].Free(Ptr);

                    return NewPtr;
                }

                return Ptr;
            }

            Free(Ptr);
            return NULL;
        }

        return Malloc(Size);
    }

    inline void Free(void* Ptr)
    {
        if(Ptr)
        {
            INT ObjectSize = GetSize(Ptr);
            INT ObjectSizeClass = InnerGetSizeClass(ObjectSize);

            if(ObjectSizeClass >= NumBins)
            {
                BigHeap::Free(Ptr);
            }
            else
            {
                LittleHeap[ObjectSizeClass].Free(Ptr);
            }
        }
    }
};

This heap layer routes small allocations into their corresponding sub-heaps and big allocations into the big heap. As we can see, the code relies on the fact that the sub-heaps provide the GetSize() function to know the size of incoming allocations for Realloc and Free, so it has to be provided or the code won’t compile.

Conclusion
Using the heap layers technique, we can now implement and test new layers quite easily to compose complex allocators easily. There’s a lot of utility layers that can be written; for example, a debug layer that keeps a header and a footer for each allocations to detect memory overwrites. Even the memory tracking system described in the second post could be written as a layer. Make sure to read the provided papers if you read that far!

Using the Moving Average Filter for Profiling Stats

October 19, 2011 Leave a comment

Introduction
When displaying profiling stats or FPS counters or whatever value that varies a lot frame to frame, you don’t want to show the real values since they can vary too much and make the stats unreadable.

Displaying an average over a certain period of time is a good idea. But how to compute that average?

Simple Moving Average
A common solution is to keep the n last values, and then compute the average of these n values to display. This is called a simple moving average and is computed like this:

tFloat MeanValue = 0.0f;
for(tU32 i=0; i<n; i++)
    MeanValue += RecentValues[i];
MeanValue /= n;

It works fine in some scenarios, like displaying a FPS counter. On the other hand, it’s not very appropriate when a lot of values needs to be computed, like for profiling stats, as the amount of data can become too large and the time spent updating the values can become too high.

Exponential Moving Average
Another solution that I’ve been using for years now is the exponential moving average. Essentially, it’s moving the current mean value toward the actual value every iteration, by a certain percentage. It’s computed like this:

MeanValue = MeanValue * 0.99f + CurrentValue * 0.01f;

The advantage of this solution is that you don’t need to keep the n last values, as opposed to the simple moving average.

Comparison
Here’s a graph comparing the two solutions:

20111019-133655.jpg

As you can see at the beginning of the graph, the EMA takes some time to react to big changes. This is easily fixed by initializing the mean value to the first value, but wasn’t done here to show how it reacts to big changes in the source data.

Conslusion
Compared to the SMA, the EMA is closer to the real averaged value, has less variations and takes less memory and CPU, which makes it a good function to compute and display averaged values in a profiling system.

Categories: Mathematics, Programming Tags:

Hashing Strings and Pointers – Avoiding Common Pitfalls

October 12, 2011 7 comments

Introduction
In a previous post, I suggested using fixed size hash tables to map strings and pointers. Since this code really needs to be fast, reducing hash collisions probabilities is very important to maximize the hash table performances. But there is other pitfalls that needs to be addressed too.

Hashing Strings
The most common error when hashing strings is to perform a string comparison (using strcmp) when traversing the hash bin list. Even if the hash function is quite bad and has a very high probability of generating the same hash value for different strings, this is not required at all.

For my tests, I hashed the complete dictionary found in /usr/dict/words which contained 212710 unique words. Here’s some of the results:

hairbrush hash_value=0x7f3ef2bb hash_bin=0x32bb
encomiums hash_value=0xa758b2bb hash_bin=0x32bb
engrained hash_value=0x32abb2bb hash_bin=0x32bb
   seavir hash_value=0x32abb2bb hash_bin=0x32bb

As you can see, all of these strings ended up in the same hash bin 0x32bb, but have different hash values (except for ‘engrained’ and ‘seavir’).

In the hash table, it looks like this:

[0x32ba] phrenologically(0x888fb2ba)->Alfreda(0xee9432ba)->...
[0x32bb] encomiums(0xa758b2bb)->engrained(0x32abb2bb)->seavir(0x32abb2bb)->...
[0x32bc] centillion(0xb44232bc)->cyclometer(0xc5a172bc)->...

As an optimization, we can only compare the original hash value (ie the value before it was made to fit the hash table size) until we found a match. Then, we need to do a string comparison in order to make sure that the strings match. This greatly optimize the hash performances and can be used with all non-PODs data types.

Attentive readers might have noticed that, in my previous post on memory tracking, I totally skipped the string comparison when hashing the allocation tags. I decided to trade possible errors due to hash value collisions with execution speed, simply because the hash function is so efficient that I never hit a single collision.

The hash function I’m using was published in Dr. Dobb’s Algorithm Alley by Bob Jenkins, in September 1997. I’ve been using this function in several games over the years and never hit a single hash collision, sometimes hashing hundreds of thousands of unique strings.

Hashing Pointers
When hashing pointers, special care must be taken due to alignment. It is tempting to simply use the pointer itself to index in the hash table, but this is a very bad idea, as pointers are mostly aligned on 4, 8, 16 bytes or even more. Alignment makes the lowest bits set to 0, and depending on the hash table size, it may lead to a non-uniform distribution, thus reducing the hash table to a simple linked-list in the worst case.

Over the years, I’ve been using Thomas Wang’s hash function to hash pointers on several platforms and it worked very well in all cases:

tU32 HashPointer(void* Ptr)
{
    tU32 Value = (tU32)Ptr;
    Value = ~Value + (Value << 15);
    Value = Value ^ (Value >> 12);
    Value = Value + (Value << 2);
    Value = Value ^ (Value >> 4);
    Value = Value * 2057;
    Value = Value ^ (Value >> 16);
    return Value;
}

Conclusion
Always test your hash functions to make sure distribution is uniform, especially for complex types (strings, GUIDs, etc). Pay special attention to types that may vary on different platforms (pointers, etc). Also, whenever possible, prefer a fixed-size hash table that fits the data you’re hashing, as opposed to using a templated, varying size one that may use more memory and won’t allow custom traversal comparison code.

Categories: Programming Tags:

Memory Management Part 2: Allocations Tracking

October 6, 2011 7 comments

Introduction
In this post, I’ll cover the basics of memory allocations tracking. If you haven’t read my previous post on allocation strategy, make sure to read it before continuing.

Here’s what we want to achieve:

  • Keep track of every single allocation in the game with its allocation source (file, line, etc).
  • Categorize every allocation into high-level categories, such as Textures, Meshes, AI, etc.
  • Track memory leaks and generate categorized memory usage reports.
  • Minimize the impacts on the game’s performance, memory usage and fragmentation.

Minimizing Impacts
Keeping track of all allocations in a game engine might use a large amount of memory by itself; when tracking memory for usage reports or leaks, this may not be a problem. On the other hand, when using the system to gather information on fragmentation, it is crucial that it has a minimum impact on memory usage, or it may alter the results.

How to achieve that? The obvious answer is that the tracking system should not allocate its memory from the game engine’s allocators. Here we introduce the FreeList class:

template<class T> class FreeList
{
    T* Free;

public:
    FreeList()
    : Free(NULL)
    {}

    inline T* New()
    {
        if(!Free)
        {
            const tU32 AllocSize = 65536;
            const tU32 NumAllocPerBatch = AllocSize / sizeof(T);

            T* AllocBatch = (T*)VirtualAlloc(NULL, AllocSize, MEM_COMMIT, PAGE_READWRITE);

            for(tU32 i=0; i<NumAllocPerBatch; i++)
            {
                Delete(AllocBatch++);
            }
        }

        T* Result = Free;
        Free = *((T**)Free);
        return Result;
    }

    inline void Delete(T* Ptr)
    {
        *(T**)Ptr = Free;
        Free = Ptr;
    }
};

A free-list is a very important concept when designing memory allocators. The particularity of this implementation is that it will allocates 64KB pages directly from the OS whenever it is empty. It then uses the first bytes of each free elements that fits in the page to link to other free elements.

The fact that it allocates directly from the OS ensure that it doesn’t interfere with normal memory allocations in the engine, and minimize fragmentation since it allocates contiguous elements in OS pages.

In heavy usage scenarios, a FreeList will link free elements between multiple pages, but still it shouldn’t affect fragmentation, since it doesn’t share the same OS pages as the game engine’s allocators.

Tracking Allocations Hook
Now we’re ready to track allocations. The first step is to “hijack” our 3 memory functions we defined in the first part (lines 4, 11 and 17):

void* _Malloc(tU32 Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line)
{
    void* Result = malloc(Size);
    RegisterAlloc(Result, Size, AllocType, Desc, File, Line);
    return Result;
}

void* _Realloc(void* Ptr, tU32 Size, const tChar* File, tU32 Line)
{
    void* Result = realloc(Ptr, Size);
    UpdateAlloc(Ptr, Result, Size, File, Line);
    return Result;
}

void _Free(void* Ptr)
{
    UnregisterAlloc(Ptr);
    return free(Ptr);
}

Implementation
Next, we need to implement RegisterAlloc, UpdateAlloc and UnregisterAlloc. The first step is to define the allocation structure AllocDesc that will hold information on a single allocation, and the allocation category TagDesc that will hold the allocation’s category information, such as the category’s name (the Desc string provided with each allocation call), the total allocated size, etc:

struct AllocDesc
{
    void* Ptr;        // Allocation address
    tU32 Size;        // Allocation size
    TagDesc* Tag;     // Pointer to the allocation category
    AllocDesc* Next;  // Next allocation in the hashtable
}
struct TagDesc
{
    tU32 CRC;         // Category description string CRC
    tU32 Size;        // Total size of all allocations in this category
    tChar Tag[128];   // Category name
    TagDesc* Next;    // Next category in the hashtable
}

The steps for registering an allocation looks like this:

  1. Create a new AllocDesc from a FreeList<AllocDesc> (line 4).
  2. Register this new allocation into its category (line 13).
  3. Register the new AllocDesc into the hashtable (lines 16-18).
void RegisterAlloc(void* Ptr, tU32 Size, const tChar* Desc, const tChar* File, tU32 Line)
{
    // Allocate a new AllocDesc from the freelist
    AllocDesc* NewAllocDesc = FreeAllocs.Allocate();

    // Set the allocation info
    NewAllocDesc->Ptr = Ptr;
    NewAllocDesc->Size = Size;
    NewAllocDesc->File = File;
    NewAllocDesc->Line = Line;

    // Register the allocation into its category (this updates the category total size)
    NewAllocDesc->Tag = TagDesc::Register(Desc, Size);

    // Register the allocation in the fixed-size hashtable
    tU32 PtrHash = GetPtrHash(Ptr) & (ALLOC_DESC_SIZE - 1);
    NewAllocDesc->Next = Hash[PtrHash];
    Hash[PtrHash] = NewAllocDesc;
}

Registering a new category goes like this:

  1. Lookup in the hashtable for the category (lines 3-17).
  2. If it already exist, increment the category’s size by the allocation size (line 12).
  3. If it doesn’t exist, create a new one from a FreeList<TagDesc> (line 20).
  4. Register the new TagDesc into the hashtable (lines 31-32).
static TagDesc* TagDesc::Register(const tChar* Tag, tU32 Size)
{
    tU32 TagCRC = Strhash(Tag);
    tU32 TagHash = TagCRC & (ALLOC_DESC_SIZE - 1);
    TagDesc* CurTag = CurTag = Hash[TagHash];

    while(CurTag)
    {
        if(CurTag->CRC == TagCRC)
        {
            // This category already exist, update its size and return it
            CurTag->Size += Size;
            return CurTag;
        }

        CurTag = CurTag->Next;
    }

    // Allocate a new TagDesc from the freelist
    CurTag = FreeTags.Allocate();

    // Set the category info
    CurTag->CRC = TagCRC;
    CurTag->Size = Size;

    tU32 TagLen = Min(strlen(Tag), 127);
    memcpy(CurTag->Tag, Tag, TagLen);
    CurTag->Tag[TagLen] = 0;

    // Register the category in the fixed-size hashtable
    CurTag->Next = Hash[TagHash];
    Hash[TagHash] = CurTag;

    return CurTag;
}

UpdateAlloc and UnregisterAlloc are straightforward. Simply update the allocation and category accordingly.

Analyzing the Data
Then comes the fun part: dumping and analyzing the gathered information. I usually dump all the data into a comma-separated values compatible format, because I tend to use Excel to analyze the data:

void DumpAllocs()
{
    printf("Address,Category,CategorySize,AllocSize\n");
    for(tU32 i=0; i<ALLOC_DESC_SIZE; i++)
    {
        AllocDesc* Alloc = Hash[i];
        while(Alloc)
        {
            printf("0x%08x,%s,%d,%d\n", Alloc->Ptr, Alloc->Tag->Tag, Alloc->Tag->Size, Alloc->Size);
            Alloc = Alloc->Next;
        }
    }
}

Which looks like this when turned into a quick report in Excel:

  Meshes 11336K
  Sounds 41553K
Textures 7867K

Tracking memory leaks
With this memory tracking up and running, it’s easy to setup a quick memory leaks detection system by dumping all allocations at a certain point in the game, then dumping another snapshot later on. Then, by comparing the 2 snapshots, we can quickly show the deltas for each categories and/or even for all allocations. The tricky part is to analyze the data correctly and identify what is normal and what are the real memory leaks.

Conclusion
This essentially covers the basics of memory allocations tracking. A lot of advanced stuff can be added, like tracking memory allocations using the call-stack, on-screen and even over-the-network memory visualization, etc. Simply implement what suits your needs best!

Memory Management Part 1: Introduction

October 3, 2011 4 comments

This is the first of a series of posts on memory management in a game engine.

Series Outline
The topics I’d like to cover are numerous and may changes depending on the response I get from previous posts. Among others, I will cover (in no particular order):

  • Allocation strategy
  • Usage tracking
  • Leaks detection
  • Allocators design
  • Multithread issues
  • Fragmentation
  • False sharing
  • Contention
  • Waste/overhead

Motivation
The memory management strategy has important effects on any game engine. Performances of the allocator(s) is important; memory leaks, overhead and fragmentation are hard to report and even harder to fix. The need for good tools is critical.

Allocation Strategy
In order to be able to track memory allocations effectively and provide services like memory leaks detection and memory usage reports, it is crucial that all allocations go through a single point in the engine. This ensure that everything is taken care of, and reduce to a minimum the unknown memory usage.

To do so, we define the only 3 functions that should be used when allocating memory:

void* _Malloc(tU32 Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line);
void* _Realloc(void* Ptr, tU32 Size, const tChar* File, tU32 Line);
void _Free(void* Ptr);

That’s a good start, but the new and delete operators should also be overloaded:

inline void* operator new(size_t Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line) { return _Malloc(Size, AllocType, Desc, File, Line); }
inline void* operator new[](size_t Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line) { return _Malloc(Size, AllocType, Desc, File, Line); }
inline void operator delete(void* Ptr) { _Free(Ptr); }
inline void operator delete[](void* Ptr) { _Free(Ptr); }

Initially, routing these functions to CRT‘s malloc, realloc and free should be fine. In future posts we’ll implement those using our own allocators.

Notice that every allocation has to provide a description string, file and line number. This will be used to track allocations, generate reports on memory usage/fragmentation and detect memory leaks.

Also notice the AllocType parameter: this is to provide a hint on which allocator to use to perform this allocation. This will be covered in future posts, such as memory fragmentation reduction.

It could be tempting to allow allocating without this information (by using default parameters), but this is something I highly recommend against: being rigorous about this will pay at the end, since it could rapidly become a mess if we let people allocate without providing accurate information and would render sub-systems such as memory usage/fragmentation/leaks tracking unusable.

To reduce the code overhead, we can define some convenient helper macros that automatically provide source file and line numbers:

#define Malloc(SIZE, TYPE, DESC) _Malloc(SIZE, TYPE, DESC, __FILE__, __LINE__)
#define Realloc(PTR, SIZE) _Realloc(PTR, SIZE, __FILE__, __LINE__)
#define New(CLASS, TYPE, DESC) new(TYPE, DESC, __FILE__, __LINE__) CLASS
#define Delete delete

At this point, all allocations through the engine should be converted to the new wrappers we just wrote. Might not an easy task depending on the size of the codebase, but it’s worth it.

Third-party external libraries
What about third-party external libraries? Any professional libraries should come with the possibility to route their internal memory allocations. If you’re using a library that doesn’t allow that, stop using it or contact the support so they provide such functionality.

Usually, third-party libraries provide either function definitions that the user need to implement (such as Box2D‘s b2Alloc/b2Free) or an interface similar to this:

class MemoryInterface
{
public:
    virtual void* malloc(int Size)=0;
    virtual void* realloc(void* Ptr, int Size)=0;
    virtual void free(void* Ptr)=0;
};

Implementing this interface should look like this:

class EngineMemoryInterface: public MemoryInterface
{
public:
    void* malloc(int Size) { return Malloc(Size, 0, "ExternalLibraryName"); }
    void* realloc(void* Ptr, int Size) { return Realloc(Ptr, Size); }
    void free(void* Ptr) { return Free(Ptr); }
};

If the library also provides internal file and line number of the allocations, simply route them too.

Conclusion
Congratulations, you wrapped all allocations to your own memory allocator stubs. This is only the first step, but a major one. To make sure everything is still under control, I usually put a breakpoint in CRT‘s malloc and run the game once in a while. Simply open the breakpoints window, type ‘malloc’ in the function name and play the game. Should not hit the breakpoint at all.

Now that we’re all set up, we’re ready to add functionality such as memory leaks tracking, etc. See you in a next post!

%d bloggers like this: