Using the Moving Average Filter for Profiling Stats

October 19, 2011 Leave a comment

Introduction
When displaying profiling stats or FPS counters or whatever value that varies a lot frame to frame, you don’t want to show the real values since they can vary too much and make the stats unreadable.

Displaying an average over a certain period of time is a good idea. But how to compute that average?

Simple Moving Average
A common solution is to keep the n last values, and then compute the average of these n values to display. This is called a simple moving average and is computed like this:

tFloat MeanValue = 0.0f;
for(tU32 i=0; i<n; i++)
    MeanValue += RecentValues[i];
MeanValue /= n;

It works fine in some scenarios, like displaying a FPS counter. On the other hand, it’s not very appropriate when a lot of values needs to be computed, like for profiling stats, as the amount of data can become too large and the time spent updating the values can become too high.

Exponential Moving Average
Another solution that I’ve been using for years now is the exponential moving average. Essentially, it’s moving the current mean value toward the actual value every iteration, by a certain percentage. It’s computed like this:

MeanValue = MeanValue * 0.99f + CurrentValue * 0.01f;

The advantage of this solution is that you don’t need to keep the n last values, as opposed to the simple moving average.

Comparison
Here’s a graph comparing the two solutions:

20111019-133655.jpg

As you can see at the beginning of the graph, the EMA takes some time to react to big changes. This is easily fixed by initializing the mean value to the first value, but wasn’t done here to show how it reacts to big changes in the source data.

Conslusion
Compared to the SMA, the EMA is closer to the real averaged value, has less variations and takes less memory and CPU, which makes it a good function to compute and display averaged values in a profiling system.

Advertisements
Categories: Mathematics, Programming Tags:

Hashing Strings and Pointers – Avoiding Common Pitfalls

October 12, 2011 7 comments

Introduction
In a previous post, I suggested using fixed size hash tables to map strings and pointers. Since this code really needs to be fast, reducing hash collisions probabilities is very important to maximize the hash table performances. But there is other pitfalls that needs to be addressed too.

Hashing Strings
The most common error when hashing strings is to perform a string comparison (using strcmp) when traversing the hash bin list. Even if the hash function is quite bad and has a very high probability of generating the same hash value for different strings, this is not required at all.

For my tests, I hashed the complete dictionary found in /usr/dict/words which contained 212710 unique words. Here’s some of the results:

hairbrush hash_value=0x7f3ef2bb hash_bin=0x32bb
encomiums hash_value=0xa758b2bb hash_bin=0x32bb
engrained hash_value=0x32abb2bb hash_bin=0x32bb
   seavir hash_value=0x32abb2bb hash_bin=0x32bb

As you can see, all of these strings ended up in the same hash bin 0x32bb, but have different hash values (except for ‘engrained’ and ‘seavir’).

In the hash table, it looks like this:

[0x32ba] phrenologically(0x888fb2ba)->Alfreda(0xee9432ba)->...
[0x32bb] encomiums(0xa758b2bb)->engrained(0x32abb2bb)->seavir(0x32abb2bb)->...
[0x32bc] centillion(0xb44232bc)->cyclometer(0xc5a172bc)->...

As an optimization, we can only compare the original hash value (ie the value before it was made to fit the hash table size) until we found a match. Then, we need to do a string comparison in order to make sure that the strings match. This greatly optimize the hash performances and can be used with all non-PODs data types.

Attentive readers might have noticed that, in my previous post on memory tracking, I totally skipped the string comparison when hashing the allocation tags. I decided to trade possible errors due to hash value collisions with execution speed, simply because the hash function is so efficient that I never hit a single collision.

The hash function I’m using was published in Dr. Dobb’s Algorithm Alley by Bob Jenkins, in September 1997. I’ve been using this function in several games over the years and never hit a single hash collision, sometimes hashing hundreds of thousands of unique strings.

Hashing Pointers
When hashing pointers, special care must be taken due to alignment. It is tempting to simply use the pointer itself to index in the hash table, but this is a very bad idea, as pointers are mostly aligned on 4, 8, 16 bytes or even more. Alignment makes the lowest bits set to 0, and depending on the hash table size, it may lead to a non-uniform distribution, thus reducing the hash table to a simple linked-list in the worst case.

Over the years, I’ve been using Thomas Wang’s hash function to hash pointers on several platforms and it worked very well in all cases:

tU32 HashPointer(void* Ptr)
{
    tU32 Value = (tU32)Ptr;
    Value = ~Value + (Value << 15);
    Value = Value ^ (Value >> 12);
    Value = Value + (Value << 2);
    Value = Value ^ (Value >> 4);
    Value = Value * 2057;
    Value = Value ^ (Value >> 16);
    return Value;
}

Conclusion
Always test your hash functions to make sure distribution is uniform, especially for complex types (strings, GUIDs, etc). Pay special attention to types that may vary on different platforms (pointers, etc). Also, whenever possible, prefer a fixed-size hash table that fits the data you’re hashing, as opposed to using a templated, varying size one that may use more memory and won’t allow custom traversal comparison code.

Categories: Programming Tags:

Memory Management Part 2: Allocations Tracking

October 6, 2011 7 comments

Introduction
In this post, I’ll cover the basics of memory allocations tracking. If you haven’t read my previous post on allocation strategy, make sure to read it before continuing.

Here’s what we want to achieve:

  • Keep track of every single allocation in the game with its allocation source (file, line, etc).
  • Categorize every allocation into high-level categories, such as Textures, Meshes, AI, etc.
  • Track memory leaks and generate categorized memory usage reports.
  • Minimize the impacts on the game’s performance, memory usage and fragmentation.

Minimizing Impacts
Keeping track of all allocations in a game engine might use a large amount of memory by itself; when tracking memory for usage reports or leaks, this may not be a problem. On the other hand, when using the system to gather information on fragmentation, it is crucial that it has a minimum impact on memory usage, or it may alter the results.

How to achieve that? The obvious answer is that the tracking system should not allocate its memory from the game engine’s allocators. Here we introduce the FreeList class:

template<class T> class FreeList
{
    T* Free;

public:
    FreeList()
    : Free(NULL)
    {}

    inline T* New()
    {
        if(!Free)
        {
            const tU32 AllocSize = 65536;
            const tU32 NumAllocPerBatch = AllocSize / sizeof(T);

            T* AllocBatch = (T*)VirtualAlloc(NULL, AllocSize, MEM_COMMIT, PAGE_READWRITE);

            for(tU32 i=0; i<NumAllocPerBatch; i++)
            {
                Delete(AllocBatch++);
            }
        }

        T* Result = Free;
        Free = *((T**)Free);
        return Result;
    }

    inline void Delete(T* Ptr)
    {
        *(T**)Ptr = Free;
        Free = Ptr;
    }
};

A free-list is a very important concept when designing memory allocators. The particularity of this implementation is that it will allocates 64KB pages directly from the OS whenever it is empty. It then uses the first bytes of each free elements that fits in the page to link to other free elements.

The fact that it allocates directly from the OS ensure that it doesn’t interfere with normal memory allocations in the engine, and minimize fragmentation since it allocates contiguous elements in OS pages.

In heavy usage scenarios, a FreeList will link free elements between multiple pages, but still it shouldn’t affect fragmentation, since it doesn’t share the same OS pages as the game engine’s allocators.

Tracking Allocations Hook
Now we’re ready to track allocations. The first step is to “hijack” our 3 memory functions we defined in the first part (lines 4, 11 and 17):

void* _Malloc(tU32 Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line)
{
    void* Result = malloc(Size);
    RegisterAlloc(Result, Size, AllocType, Desc, File, Line);
    return Result;
}

void* _Realloc(void* Ptr, tU32 Size, const tChar* File, tU32 Line)
{
    void* Result = realloc(Ptr, Size);
    UpdateAlloc(Ptr, Result, Size, File, Line);
    return Result;
}

void _Free(void* Ptr)
{
    UnregisterAlloc(Ptr);
    return free(Ptr);
}

Implementation
Next, we need to implement RegisterAlloc, UpdateAlloc and UnregisterAlloc. The first step is to define the allocation structure AllocDesc that will hold information on a single allocation, and the allocation category TagDesc that will hold the allocation’s category information, such as the category’s name (the Desc string provided with each allocation call), the total allocated size, etc:

struct AllocDesc
{
    void* Ptr;        // Allocation address
    tU32 Size;        // Allocation size
    TagDesc* Tag;     // Pointer to the allocation category
    AllocDesc* Next;  // Next allocation in the hashtable
}
struct TagDesc
{
    tU32 CRC;         // Category description string CRC
    tU32 Size;        // Total size of all allocations in this category
    tChar Tag[128];   // Category name
    TagDesc* Next;    // Next category in the hashtable
}

The steps for registering an allocation looks like this:

  1. Create a new AllocDesc from a FreeList<AllocDesc> (line 4).
  2. Register this new allocation into its category (line 13).
  3. Register the new AllocDesc into the hashtable (lines 16-18).
void RegisterAlloc(void* Ptr, tU32 Size, const tChar* Desc, const tChar* File, tU32 Line)
{
    // Allocate a new AllocDesc from the freelist
    AllocDesc* NewAllocDesc = FreeAllocs.Allocate();

    // Set the allocation info
    NewAllocDesc->Ptr = Ptr;
    NewAllocDesc->Size = Size;
    NewAllocDesc->File = File;
    NewAllocDesc->Line = Line;

    // Register the allocation into its category (this updates the category total size)
    NewAllocDesc->Tag = TagDesc::Register(Desc, Size);

    // Register the allocation in the fixed-size hashtable
    tU32 PtrHash = GetPtrHash(Ptr) & (ALLOC_DESC_SIZE - 1);
    NewAllocDesc->Next = Hash[PtrHash];
    Hash[PtrHash] = NewAllocDesc;
}

Registering a new category goes like this:

  1. Lookup in the hashtable for the category (lines 3-17).
  2. If it already exist, increment the category’s size by the allocation size (line 12).
  3. If it doesn’t exist, create a new one from a FreeList<TagDesc> (line 20).
  4. Register the new TagDesc into the hashtable (lines 31-32).
static TagDesc* TagDesc::Register(const tChar* Tag, tU32 Size)
{
    tU32 TagCRC = Strhash(Tag);
    tU32 TagHash = TagCRC & (ALLOC_DESC_SIZE - 1);
    TagDesc* CurTag = CurTag = Hash[TagHash];

    while(CurTag)
    {
        if(CurTag->CRC == TagCRC)
        {
            // This category already exist, update its size and return it
            CurTag->Size += Size;
            return CurTag;
        }

        CurTag = CurTag->Next;
    }

    // Allocate a new TagDesc from the freelist
    CurTag = FreeTags.Allocate();

    // Set the category info
    CurTag->CRC = TagCRC;
    CurTag->Size = Size;

    tU32 TagLen = Min(strlen(Tag), 127);
    memcpy(CurTag->Tag, Tag, TagLen);
    CurTag->Tag[TagLen] = 0;

    // Register the category in the fixed-size hashtable
    CurTag->Next = Hash[TagHash];
    Hash[TagHash] = CurTag;

    return CurTag;
}

UpdateAlloc and UnregisterAlloc are straightforward. Simply update the allocation and category accordingly.

Analyzing the Data
Then comes the fun part: dumping and analyzing the gathered information. I usually dump all the data into a comma-separated values compatible format, because I tend to use Excel to analyze the data:

void DumpAllocs()
{
    printf("Address,Category,CategorySize,AllocSize\n");
    for(tU32 i=0; i<ALLOC_DESC_SIZE; i++)
    {
        AllocDesc* Alloc = Hash[i];
        while(Alloc)
        {
            printf("0x%08x,%s,%d,%d\n", Alloc->Ptr, Alloc->Tag->Tag, Alloc->Tag->Size, Alloc->Size);
            Alloc = Alloc->Next;
        }
    }
}

Which looks like this when turned into a quick report in Excel:

  Meshes 11336K
  Sounds 41553K
Textures 7867K

Tracking memory leaks
With this memory tracking up and running, it’s easy to setup a quick memory leaks detection system by dumping all allocations at a certain point in the game, then dumping another snapshot later on. Then, by comparing the 2 snapshots, we can quickly show the deltas for each categories and/or even for all allocations. The tricky part is to analyze the data correctly and identify what is normal and what are the real memory leaks.

Conclusion
This essentially covers the basics of memory allocations tracking. A lot of advanced stuff can be added, like tracking memory allocations using the call-stack, on-screen and even over-the-network memory visualization, etc. Simply implement what suits your needs best!

Memory Management Part 1: Introduction

October 3, 2011 4 comments

This is the first of a series of posts on memory management in a game engine.

Series Outline
The topics I’d like to cover are numerous and may changes depending on the response I get from previous posts. Among others, I will cover (in no particular order):

  • Allocation strategy
  • Usage tracking
  • Leaks detection
  • Allocators design
  • Multithread issues
  • Fragmentation
  • False sharing
  • Contention
  • Waste/overhead

Motivation
The memory management strategy has important effects on any game engine. Performances of the allocator(s) is important; memory leaks, overhead and fragmentation are hard to report and even harder to fix. The need for good tools is critical.

Allocation Strategy
In order to be able to track memory allocations effectively and provide services like memory leaks detection and memory usage reports, it is crucial that all allocations go through a single point in the engine. This ensure that everything is taken care of, and reduce to a minimum the unknown memory usage.

To do so, we define the only 3 functions that should be used when allocating memory:

void* _Malloc(tU32 Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line);
void* _Realloc(void* Ptr, tU32 Size, const tChar* File, tU32 Line);
void _Free(void* Ptr);

That’s a good start, but the new and delete operators should also be overloaded:

inline void* operator new(size_t Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line) { return _Malloc(Size, AllocType, Desc, File, Line); }
inline void* operator new[](size_t Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line) { return _Malloc(Size, AllocType, Desc, File, Line); }
inline void operator delete(void* Ptr) { _Free(Ptr); }
inline void operator delete[](void* Ptr) { _Free(Ptr); }

Initially, routing these functions to CRT‘s malloc, realloc and free should be fine. In future posts we’ll implement those using our own allocators.

Notice that every allocation has to provide a description string, file and line number. This will be used to track allocations, generate reports on memory usage/fragmentation and detect memory leaks.

Also notice the AllocType parameter: this is to provide a hint on which allocator to use to perform this allocation. This will be covered in future posts, such as memory fragmentation reduction.

It could be tempting to allow allocating without this information (by using default parameters), but this is something I highly recommend against: being rigorous about this will pay at the end, since it could rapidly become a mess if we let people allocate without providing accurate information and would render sub-systems such as memory usage/fragmentation/leaks tracking unusable.

To reduce the code overhead, we can define some convenient helper macros that automatically provide source file and line numbers:

#define Malloc(SIZE, TYPE, DESC) _Malloc(SIZE, TYPE, DESC, __FILE__, __LINE__)
#define Realloc(PTR, SIZE) _Realloc(PTR, SIZE, __FILE__, __LINE__)
#define New(CLASS, TYPE, DESC) new(TYPE, DESC, __FILE__, __LINE__) CLASS
#define Delete delete

At this point, all allocations through the engine should be converted to the new wrappers we just wrote. Might not an easy task depending on the size of the codebase, but it’s worth it.

Third-party external libraries
What about third-party external libraries? Any professional libraries should come with the possibility to route their internal memory allocations. If you’re using a library that doesn’t allow that, stop using it or contact the support so they provide such functionality.

Usually, third-party libraries provide either function definitions that the user need to implement (such as Box2D‘s b2Alloc/b2Free) or an interface similar to this:

class MemoryInterface
{
public:
    virtual void* malloc(int Size)=0;
    virtual void* realloc(void* Ptr, int Size)=0;
    virtual void free(void* Ptr)=0;
};

Implementing this interface should look like this:

class EngineMemoryInterface: public MemoryInterface
{
public:
    void* malloc(int Size) { return Malloc(Size, 0, "ExternalLibraryName"); }
    void* realloc(void* Ptr, int Size) { return Realloc(Ptr, Size); }
    void free(void* Ptr) { return Free(Ptr); }
};

If the library also provides internal file and line number of the allocations, simply route them too.

Conclusion
Congratulations, you wrapped all allocations to your own memory allocator stubs. This is only the first step, but a major one. To make sure everything is still under control, I usually put a breakpoint in CRT‘s malloc and run the game once in a while. Simply open the breakpoints window, type ‘malloc’ in the function name and play the game. Should not hit the breakpoint at all.

Now that we’re all set up, we’re ready to add functionality such as memory leaks tracking, etc. See you in a next post!

Generating primes

September 27, 2011 1 comment

Here’s my implementation of the sieve of Eratosthenes in pyhton:

cur_num = 2
multiples = {2 : 2}

while True:
    is_prime = True

    for v in multiples:
        nv = multiples[v] - 1
        if nv == 0:
            is_prime = False
            multiples[v] = v
        else:
            multiples[v] = nv

    if is_prime == True:
        print cur_num
        multiples[cur_num] = cur_num

    cur_num = cur_num + 1

The difference with the pseudocode found on the wikipedia page is that it doesn’t try to reject numbers on a fixed range, it just compute every occurrence of multiples of known primes at every step. I based my implementation on the fact that every non-prime number is composed of n primes multiplied together where n>1, in a unique way. My previous post on prime numbers used that fact to implement a quick RTTI IsChildOf test.

How does it works? For every prime number encountered, a counter equal to that number is created (in the multiples map). Then, at each iteration, we decrement every counters. When a counter reaches zero, it means that the current number can be divided by this prime number, thus it is not prime. When all counters are different from zero, the current number must be a prime, since there’s no known prime divisor to compose it.

A nice property of this implementation is that when we encounter a non-prime number, all counters that are currently at zero represents all the primes that are used to compose the current non-prime number. Here’s the updated code:

cur_num = 2
multiples = {2 : 2}

while True:
    is_prime = True

    for v in multiples:
        nv = multiples[v] - 1
        if nv == 0:
            is_prime = False
            multiples[v] = v
        else:
            multiples[v] = nv
            
    if is_prime == True:
        print cur_num, "[prime]"
        multiples[cur_num] = cur_num
    else:
        primes = list()
        for v in multiples:
            if multiples[v] == v:
                primes.append(v)
        print cur_num, "[nonprime] ", primes

    cur_num = cur_num + 1

Which outputs this:

...
83 [prime]
84 [nonprime]  [2, 3, 7]
85 [nonprime]  [5, 17]
86 [nonprime]  [2, 43]
87 [nonprime]  [3, 29]
88 [nonprime]  [2, 11]
89 [prime]
90 [nonprime]  [2, 3, 5]
91 [nonprime]  [7, 13]
92 [nonprime]  [2, 23]
93 [nonprime]  [3, 31]
94 [nonprime]  [2, 47]
95 [nonprime]  [5, 19]
96 [nonprime]  [2, 3]
97 [prime]
...

This is not the best implementation for this algorithm, as it was mainly done to test the idea I had in mind and to learn python. The main problem is as we search for higher primes, speed decrease quite fast due to the number of prime counters to lookup. There are several things we could do, like skip even numbers and updates counters accordingly or write a C++ implementation.

Lessons learnt while spinning

September 24, 2011 3 comments

In a recent project I worked on, I tried to minimize context switches by strategically replacing a critical section in our memory allocator by a custom spinlock that looked like this:

class SpinLock
{
    volatile tInt LockSem;

public:
    FORCEINLINE SpinLock()
    : LockSem(0)
    {}

    FORCEINLINE tBool Lock()
    {
        while(1)
        {
            // Atomically swap the lock variable with 1 if it's currently equal to 0
            if(!InterlockedCompareExchange(&LockSem, 1, 0))
            {
                // We successfully acquired the lock
                MemoryBarrier();
                return;
            }
        }
    }

    FORCEINLINE void Unlock()
    {
        MemoryBarrier();
        LockSem = 0;
    }
};

This indeed worked as predicted, but produced an unwanted side-effect on platforms without threads priority inversion (where a high-priority thread spins in a loop while waiting for a low-priority thread to release a lock) as the Xbox 360 (described in this technical article).

It created rare situations where the game was deadlocked, and the time spent to figure-out what was going on was quite high. At first, we played with threads priorities and affinities, but it had other unwanted side-effects. Another solution tried was to yield after a certain amount of spin loops, generating a context switch that let the OS reschedule threads correctly. In the end, we reverted to a critical section.

Lesson learnt? We have to be very careful when busy-waiting, especially on OSs that don’t have priority inversion protection mechanisms. Using the OS synchronization primitives is the best way of coordinating multiple threads. When locking becomes a performance bottleneck, we have to address the underlying problem (using lockless programming, etc).

Categories: Programming Tags:

GUID generation

September 21, 2011 1 comment

I recently had to generate globally unique identifiers (GUIDs) for a personal project that I have at home. They could be generated concurrently on several threads and on multiple machines over the network, and the only rules I wanted to stick with were to avoid duplicates and be as fast as possible. Here’s what I ended up with:

void GUID::GenerateUnique()
{
    Time Time;
    GetTime(Time);

    static tU16 StaticMask = 0;
    StaticMask = InterlockedIncrement(StaticMask);

    A = GUUID;
    B = GMACAddress >> 16;
    C = (StaticMask << 16) | (GMACAddress & 0xffff);
    D = (Time.Second + Time.Minute * 60 + Time.Hour * 60 * 60) | (Time.Month << 17) | (Time.Day << 22) | ((Time.Year - 2011) << 27);

    // Simple encryption
    tU32 Mask = (StaticMask << 16) | ((StaticMask * 0x0bb38435) + 0x3619636b);
    A = ~A ^ Mask;
    B = ~B ^ Mask;
    C = ~C ^ StaticMask; // don't encrypt the mask itself
    D = ~D ^ Mask;
}

A, B, C and D are 32-bits unsigned values. A contains a CRC32 of the PC’s username. The PC’s MAC address is encoded in B and C parts. The remaining 16 bits of C also contains a sequential number which is atomically incremented at each call to avoid generating the same GUID when called from multiple threads. D contains a timestamp, similar to a Unix Time. At the end, a quite simple encryption scheme is used to make sure that the MAC address isn’t easily exposed directly in the GUID.

One cool thing about this is that at any point, a GUID can be decrypted and everything can be extracted from it; the CRC32 of the user who generated it, the MAC address and the timestamp:

String GUID::GetDescStr() const
{
    tU16 StaticMask = ~(C >> 16);
    tU32 Mask = (StaticMask << 16) | (StaticMask * 0x0bb38435) + 0x3619636b;
    tU32 Mac0 = ~(B ^ Mask);
    tU32 Mac1 = ~(C ^ StaticMask);

    String Result = String::Printf(
        "UUID:%08x MAC:%02x-%02x-%02x-%02x-%02x-%02x ",
        ~(A ^ Mask),
        (tU32)((Mac0 & 0xff000000) >> 24),
        (tU32)((Mac0 & 0xff0000) >> 16),
        (tU32)((Mac0 & 0xff00) >> 8),
        (tU32)((Mac0 & 0xff)),
        (tU32)((Mac1 & 0xff00) >> 8),
        (tU32)((Mac1 & 0xff))
    );

    tU32 Timestamp = ~(D ^ Mask);
    tU32 Seconds = (Timestamp & 0x1ffff);
    tU32 Month = ((Timestamp >> 17) & 0x1f);
    tU32 Day = ((Timestamp >> 22) & 0x1f);
    tU32 Year = (Timestamp >> 27) + 2011;
    tU32 Hour = (Seconds / (60 * 60));
    Seconds -= Hour * 60 * 60;
    tU32 Minute = (Seconds / 60);
    Seconds -= Minute * 60;

    Result += String::Printf("Date:%d-%02d-%02d %d:%02d:%02d ", Year, Month, Day, Hour, Minute, Seconds);
    Result += String::Printf("Mask:0x%08x", Mask);

    return Result;
}

Size didn’t matter in my case, but this could be easily reduced to 96-bits by removing the CRC32 of the username, without changing the uniqueness of the GUIDs. If the size really matters, one could possibly reduce it to 64-bits by using the CRC32 of the username and the timestamp, but the function would possibly need to be protected by a critical-section to avoid 2 threads generating the exact same GUID.

Categories: Programming Tags:
%d bloggers like this: