Memory Management Part 1: Introduction

October 3, 2011 4 comments

This is the first of a series of posts on memory management in a game engine.

Series Outline
The topics I’d like to cover are numerous and may changes depending on the response I get from previous posts. Among others, I will cover (in no particular order):

  • Allocation strategy
  • Usage tracking
  • Leaks detection
  • Allocators design
  • Multithread issues
  • Fragmentation
  • False sharing
  • Contention
  • Waste/overhead

The memory management strategy has important effects on any game engine. Performances of the allocator(s) is important; memory leaks, overhead and fragmentation are hard to report and even harder to fix. The need for good tools is critical.

Allocation Strategy
In order to be able to track memory allocations effectively and provide services like memory leaks detection and memory usage reports, it is crucial that all allocations go through a single point in the engine. This ensure that everything is taken care of, and reduce to a minimum the unknown memory usage.

To do so, we define the only 3 functions that should be used when allocating memory:

void* _Malloc(tU32 Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line);
void* _Realloc(void* Ptr, tU32 Size, const tChar* File, tU32 Line);
void _Free(void* Ptr);

That’s a good start, but the new and delete operators should also be overloaded:

inline void* operator new(size_t Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line) { return _Malloc(Size, AllocType, Desc, File, Line); }
inline void* operator new[](size_t Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line) { return _Malloc(Size, AllocType, Desc, File, Line); }
inline void operator delete(void* Ptr) { _Free(Ptr); }
inline void operator delete[](void* Ptr) { _Free(Ptr); }

Initially, routing these functions to CRT‘s malloc, realloc and free should be fine. In future posts we’ll implement those using our own allocators.

Notice that every allocation has to provide a description string, file and line number. This will be used to track allocations, generate reports on memory usage/fragmentation and detect memory leaks.

Also notice the AllocType parameter: this is to provide a hint on which allocator to use to perform this allocation. This will be covered in future posts, such as memory fragmentation reduction.

It could be tempting to allow allocating without this information (by using default parameters), but this is something I highly recommend against: being rigorous about this will pay at the end, since it could rapidly become a mess if we let people allocate without providing accurate information and would render sub-systems such as memory usage/fragmentation/leaks tracking unusable.

To reduce the code overhead, we can define some convenient helper macros that automatically provide source file and line numbers:

#define Malloc(SIZE, TYPE, DESC) _Malloc(SIZE, TYPE, DESC, __FILE__, __LINE__)
#define Realloc(PTR, SIZE) _Realloc(PTR, SIZE, __FILE__, __LINE__)
#define New(CLASS, TYPE, DESC) new(TYPE, DESC, __FILE__, __LINE__) CLASS
#define Delete delete

At this point, all allocations through the engine should be converted to the new wrappers we just wrote. Might not an easy task depending on the size of the codebase, but it’s worth it.

Third-party external libraries
What about third-party external libraries? Any professional libraries should come with the possibility to route their internal memory allocations. If you’re using a library that doesn’t allow that, stop using it or contact the support so they provide such functionality.

Usually, third-party libraries provide either function definitions that the user need to implement (such as Box2D‘s b2Alloc/b2Free) or an interface similar to this:

class MemoryInterface
    virtual void* malloc(int Size)=0;
    virtual void* realloc(void* Ptr, int Size)=0;
    virtual void free(void* Ptr)=0;

Implementing this interface should look like this:

class EngineMemoryInterface: public MemoryInterface
    void* malloc(int Size) { return Malloc(Size, 0, "ExternalLibraryName"); }
    void* realloc(void* Ptr, int Size) { return Realloc(Ptr, Size); }
    void free(void* Ptr) { return Free(Ptr); }

If the library also provides internal file and line number of the allocations, simply route them too.

Congratulations, you wrapped all allocations to your own memory allocator stubs. This is only the first step, but a major one. To make sure everything is still under control, I usually put a breakpoint in CRT‘s malloc and run the game once in a while. Simply open the breakpoints window, type ‘malloc’ in the function name and play the game. Should not hit the breakpoint at all.

Now that we’re all set up, we’re ready to add functionality such as memory leaks tracking, etc. See you in a next post!

Generating primes

September 27, 2011 1 comment

Here’s my implementation of the sieve of Eratosthenes in pyhton:

cur_num = 2
multiples = {2 : 2}

while True:
    is_prime = True

    for v in multiples:
        nv = multiples[v] - 1
        if nv == 0:
            is_prime = False
            multiples[v] = v
            multiples[v] = nv

    if is_prime == True:
        print cur_num
        multiples[cur_num] = cur_num

    cur_num = cur_num + 1

The difference with the pseudocode found on the wikipedia page is that it doesn’t try to reject numbers on a fixed range, it just compute every occurrence of multiples of known primes at every step. I based my implementation on the fact that every non-prime number is composed of n primes multiplied together where n>1, in a unique way. My previous post on prime numbers used that fact to implement a quick RTTI IsChildOf test.

How does it works? For every prime number encountered, a counter equal to that number is created (in the multiples map). Then, at each iteration, we decrement every counters. When a counter reaches zero, it means that the current number can be divided by this prime number, thus it is not prime. When all counters are different from zero, the current number must be a prime, since there’s no known prime divisor to compose it.

A nice property of this implementation is that when we encounter a non-prime number, all counters that are currently at zero represents all the primes that are used to compose the current non-prime number. Here’s the updated code:

cur_num = 2
multiples = {2 : 2}

while True:
    is_prime = True

    for v in multiples:
        nv = multiples[v] - 1
        if nv == 0:
            is_prime = False
            multiples[v] = v
            multiples[v] = nv
    if is_prime == True:
        print cur_num, "[prime]"
        multiples[cur_num] = cur_num
        primes = list()
        for v in multiples:
            if multiples[v] == v:
        print cur_num, "[nonprime] ", primes

    cur_num = cur_num + 1

Which outputs this:

83 [prime]
84 [nonprime]  [2, 3, 7]
85 [nonprime]  [5, 17]
86 [nonprime]  [2, 43]
87 [nonprime]  [3, 29]
88 [nonprime]  [2, 11]
89 [prime]
90 [nonprime]  [2, 3, 5]
91 [nonprime]  [7, 13]
92 [nonprime]  [2, 23]
93 [nonprime]  [3, 31]
94 [nonprime]  [2, 47]
95 [nonprime]  [5, 19]
96 [nonprime]  [2, 3]
97 [prime]

This is not the best implementation for this algorithm, as it was mainly done to test the idea I had in mind and to learn python. The main problem is as we search for higher primes, speed decrease quite fast due to the number of prime counters to lookup. There are several things we could do, like skip even numbers and updates counters accordingly or write a C++ implementation.

Lessons learnt while spinning

September 24, 2011 3 comments

In a recent project I worked on, I tried to minimize context switches by strategically replacing a critical section in our memory allocator by a custom spinlock that looked like this:

class SpinLock
    volatile tInt LockSem;

    FORCEINLINE SpinLock()
    : LockSem(0)

    FORCEINLINE tBool Lock()
            // Atomically swap the lock variable with 1 if it's currently equal to 0
            if(!InterlockedCompareExchange(&LockSem, 1, 0))
                // We successfully acquired the lock

    FORCEINLINE void Unlock()
        LockSem = 0;

This indeed worked as predicted, but produced an unwanted side-effect on platforms without threads priority inversion (where a high-priority thread spins in a loop while waiting for a low-priority thread to release a lock) as the Xbox 360 (described in this technical article).

It created rare situations where the game was deadlocked, and the time spent to figure-out what was going on was quite high. At first, we played with threads priorities and affinities, but it had other unwanted side-effects. Another solution tried was to yield after a certain amount of spin loops, generating a context switch that let the OS reschedule threads correctly. In the end, we reverted to a critical section.

Lesson learnt? We have to be very careful when busy-waiting, especially on OSs that don’t have priority inversion protection mechanisms. Using the OS synchronization primitives is the best way of coordinating multiple threads. When locking becomes a performance bottleneck, we have to address the underlying problem (using lockless programming, etc).

Categories: Programming Tags:

GUID generation

September 21, 2011 1 comment

I recently had to generate globally unique identifiers (GUIDs) for a personal project that I have at home. They could be generated concurrently on several threads and on multiple machines over the network, and the only rules I wanted to stick with were to avoid duplicates and be as fast as possible. Here’s what I ended up with:

void GUID::GenerateUnique()
    Time Time;

    static tU16 StaticMask = 0;
    StaticMask = InterlockedIncrement(StaticMask);

    A = GUUID;
    B = GMACAddress >> 16;
    C = (StaticMask << 16) | (GMACAddress & 0xffff);
    D = (Time.Second + Time.Minute * 60 + Time.Hour * 60 * 60) | (Time.Month << 17) | (Time.Day << 22) | ((Time.Year - 2011) << 27);

    // Simple encryption
    tU32 Mask = (StaticMask << 16) | ((StaticMask * 0x0bb38435) + 0x3619636b);
    A = ~A ^ Mask;
    B = ~B ^ Mask;
    C = ~C ^ StaticMask; // don't encrypt the mask itself
    D = ~D ^ Mask;

A, B, C and D are 32-bits unsigned values. A contains a CRC32 of the PC’s username. The PC’s MAC address is encoded in B and C parts. The remaining 16 bits of C also contains a sequential number which is atomically incremented at each call to avoid generating the same GUID when called from multiple threads. D contains a timestamp, similar to a Unix Time. At the end, a quite simple encryption scheme is used to make sure that the MAC address isn’t easily exposed directly in the GUID.

One cool thing about this is that at any point, a GUID can be decrypted and everything can be extracted from it; the CRC32 of the user who generated it, the MAC address and the timestamp:

String GUID::GetDescStr() const
    tU16 StaticMask = ~(C >> 16);
    tU32 Mask = (StaticMask << 16) | (StaticMask * 0x0bb38435) + 0x3619636b;
    tU32 Mac0 = ~(B ^ Mask);
    tU32 Mac1 = ~(C ^ StaticMask);

    String Result = String::Printf(
        "UUID:%08x MAC:%02x-%02x-%02x-%02x-%02x-%02x ",
        ~(A ^ Mask),
        (tU32)((Mac0 & 0xff000000) >> 24),
        (tU32)((Mac0 & 0xff0000) >> 16),
        (tU32)((Mac0 & 0xff00) >> 8),
        (tU32)((Mac0 & 0xff)),
        (tU32)((Mac1 & 0xff00) >> 8),
        (tU32)((Mac1 & 0xff))

    tU32 Timestamp = ~(D ^ Mask);
    tU32 Seconds = (Timestamp & 0x1ffff);
    tU32 Month = ((Timestamp >> 17) & 0x1f);
    tU32 Day = ((Timestamp >> 22) & 0x1f);
    tU32 Year = (Timestamp >> 27) + 2011;
    tU32 Hour = (Seconds / (60 * 60));
    Seconds -= Hour * 60 * 60;
    tU32 Minute = (Seconds / 60);
    Seconds -= Minute * 60;

    Result += String::Printf("Date:%d-%02d-%02d %d:%02d:%02d ", Year, Month, Day, Hour, Minute, Seconds);
    Result += String::Printf("Mask:0x%08x", Mask);

    return Result;

Size didn’t matter in my case, but this could be easily reduced to 96-bits by removing the CRC32 of the username, without changing the uniqueness of the GUIDs. If the size really matters, one could possibly reduce it to 64-bits by using the CRC32 of the username and the timestamp, but the function would possibly need to be protected by a critical-section to avoid 2 threads generating the exact same GUID.

Categories: Programming Tags:

Working hard doesn’t mean working late

September 19, 2011 2 comments

John Carmack, founder and technical director of Id Software, recently posted this on Twitter: “The meme that people are objectively less productive in total when working more than 40 hrs a week is idiotic.”. This is true when working 45-50 hours a week (or even more) because you’re motivated to do so. The problem starts when overtime becomes “highly suggested”, directly by managers or indirectly (peer pressure).

I’ve been working in this industry for 14 years now, and when I look back at my first years, Carmack’s statement made a lot more sense. I had no wife or kids, I was mainly spending my free time at home programming anyways, so why not stay at work 60 hours a week and get free food? But time changes, I now have kids, and as soon as I have to leave later than expected every day, motivation goes down a lot. This blog post really explains how I feel.

Look around you, and tell me what do you see? Programmers working 60-70 hours per week, arriving late in the morning, taking long coffee breaks and laughing looking at funny cats on youtube, or focused programmers with headsets that comes early in the morning, working hard on their tasks 40-45 hours per week, getting the job done? What do you prefer your team to be composed of?

Categories: Misc Tags:

Prime numbers

September 18, 2011 6 comments

Prime numbers are a very interesting topic. I often find myself quite uncomfortable when setting my car’s radio volume on an even number, and I really feel better when it’s a prime number. But this is a complete different story.

So, what are the uses of prime numbers in programming? Cryptography, random numbers generation, better hashing functions, etc. One thing that I recently discovered about prime numbers is that if you multiply n different primes together, it creates a special composite number (if anyone knows the real name of those, please tell me) that has a quite interesting property: there’s a way to know if a certain prime p was used or not to compose it simply by dividing by p. If the result is an integer, then p was used to compose the number. Take this example:

    3 * 7 * 13 * 53 * 227 = 3284463

To know if prime number 19 was used to compose it, we do:

    3284463 / 19 = 172866.474

Since the result is not an integer, 19 wasn’t used to compose the number 3284463. But, if we take a look at:

    3284463 / 3 = 1094821
    3284463 / 7 = 469209
    3284463 / 13 = 252651
    3284463 / 53 = 61971
    3284463 / 227 = 14469

A practical usage of this property is for a RTTI system. By assigning a different prime number to each class in the hierarchy, and by multiplying each child’s class value by it’s parent value, we can determine if a class A derives from class B simply by dividing its composite number by B‘s number, instead of going through all the parents chain:

Bool Class::IsChildOf(const Class* B) const
    return ((Prime % B->Prime) == 0);
Categories: Prime numbers Tags: ,
%d bloggers like this: