Optimizing the recursive read-write spinlock

January 12, 2014 7 comments

In my previous post on implementing a recursive read-write spinlock, I briefly talked about trivial optimizations that could be done to improve the time spent in the lock functions.

In this post, we’ll measure these optimizations, propose more optimizations to the read-write spinlock and compare all optimizations results.

Stress tests
The numbers that will be given in this post comes from two stress-test applications that were written to compare the different optimizations. Essentially, the tests consist of spawning some threads and executing huge amount of locks in parallel: a read test that only test the reader locks without any writers involved, and a read-write test that tests reader and writer locks simultaneously.

Measuring optimizations from the first post

The cost of atomic operations
The first optimization that was done in the first post was to reduce the number of atomic operations by busy-waiting while the writer bit was set and only attempt an atomic operation when required.

The measured speedup was ~34% faster in the read-write test. We can clearly see that atomic operations aren’t free.

Busy-waiting
The second optimization that was done in the first post was to yield the processor while waiting.

The measured speedup was ~5% faster in the read-write test; since the test was using the same number of threads than logical cores, the speedup isn’t spectacular. On the other hand, applications with a lot more threads than logical processors would observe a higher performance gain, so this optimization is still valid for all cases.

Readers contention

A thing that I noticed after writing the spinlock post was that it is suffering from readers contention, due to the CAS operations.

Consider this simple example:

Thread-1               Thread-2               Thread-3
1: Read lock value     1: Read lock value     1: Read lock value
2: Increment readers   2: Increment readers   2: Increment readers
3: CAS (succeeds)      3: CAS (fails)         3: CAS (fails)
4: ...                 4: Read lock value     4: Read lock value
5: ...                 5: Increment readers   5: Increment readers
6: ...                 6: CAS (succeeds)      6: CAS (fails)
7: ...                 7: ...                 7: Read lock value
8: ...                 8: ...                 8: Increment readers
9: ...                 9: ...                 9: CAS (succeeds)

This is technically not contention, as each thread is making progress, but it goes against the original purpose of the read-write spinlock mentionned in the first post: must allow any number of readers to acquire the lock without contention.

The key here to solve this problem is to avoid CAS loops by using other types of atomic instructions.

New implementation

In the new implementation, I’ll solely use atomic increments, decrements, adds and subtracts. This will completly remove the CAS contention.

The spinlock will still be implemented using a 32-bits value, but more bits will need to be reserved for the writer (we’ll see why shortly):

|WWWWWWWWWWWW|RRRRRRRRRRRRRRRRRRRR|

The writer bits are still carefully placed on the most significant bits, as described in the previous post.

Acquiring a reader lock

To acquire the reader lock, we atomically increment the lock value and inspect the return value (line 8): if any of writer bits are set, it means that a writer held the lock before us and we failed. In that case, we atomically decrement the lock value and try again (line 11). Otherwise, we acquired the reader lock successfuly.

FORCEINLINE void LockReader()
{
    for(;;)
    {
        // Wait for active writer to release the lock
        while(Lock & 0xfff00000)
            kYieldProcessor();

        if((kInterlockedIncrement(&Lock) & 0xfff00000) == 0)
            return;

        kInterlockedDecrement(&Lock);
    }
}

Releasing a reader lock

Releasing a reader lock with the new implementation can be achieved by atomically decrementing the lock value (same code as previous implementation):

FORCEINLINE void UnlockReader()
{
    kInterlockedDecrement(&Lock);
}

Acquiring the writer lock

To acquire the writer lock, we atomically increment the numbers of writers (using an atomic add) and inspect the return value (line 9): if only the first writer bit was set during this operation, we acquired the writer lock successfuly and wait for any readers to end (line 12). Otherwise, another writer held the lock before us and we failed. In that case, we atomically decrement the number of writers (using an atomic subtract) and try again (line 18).

FORCEINLINE void LockWriter()
{
    for(;;)
    {
        // Wait for active writer to release the lock
        while(Lock & 0xfff00000)
            kYieldProcessor();

        if((kInterlockedAdd(&Lock, 0x100000) & 0xfff00000) == 0x100000)
        {
            // Wait until there's no more readers
            while(Lock & 0x000fffff)
                kYieldProcessor();

            return;
        }

        kInterlockedSub(&Lock, 0x100000);
    }
}

This is why the lock value now needs more bits for the writers part: when several writers try to acquire the lock, the value might get incremented up to at least the number of active threads. When the value isn’t exactly one, the writer failed to acquire the lock and decrement the value before trying again.

In this example, 12 bits were reserved for to writers, allowing up to 4096 concurrent writer threads. This value should be tweaked depending on the application.

Releasing the writer lock

Releasing a writer lock is done by atomically decrementing the number of writers, using an atomic subtract:

FORCEINLINE void UnlockWriter()
{
    kInterlockedSub(&Lock, 0x100000);
}

Optimization results

The measured speedup for the new implementation in the read-write test was ~62%. Remember that this test was only acquiring the reader lock, with no writers involved. This is quite an improvement!

Overall results comparison

Here’s the results of all optimizations described in this post:

  • SpinLock is a standard spinlock.
  • RWLock is the read-write spinlock from the first post, without any optimization.
  • RWLock_1 adds the atomic operations reduction optimization.
  • RWLock_2 adds the yield processor optimization.
  • RWLock_3 is the new implementation, with all optimizations.

Conclusion

Some things to keep in mind:

  • Atomic operations aren’t free; in fact, they cost a lot and should be avoided whenever possible.
  • When writing lockfree algorithms using CAS loops, think twice and ask yourself if this could be done using other atomic operations.

Implementing a recursive read-write spinlock

January 3, 2014 8 comments

Very often I see programmers using a critical section to make some code thread-safe. What happens is that most of the time, the protected code will only read data, while blocking other reading threads and lead to poor performances.

Most of the time, a read-write lock can be used to only block readers when there’s a write operation in progress, greatly reducing the time spent waiting for other threads to release the lock.

Why a spinlock?
In a previous post, I spoke about the dangers of busy-waiting, especially when implementing our own spinlocks on OSs that doesn’t support priority inversion protection mechanisms.

However, writing our own spinlocks implementation can have several advantages, the main one being pedagogic. Other advantages of spinlocks is that they are lightweight and doesn’t trigger a context switch when waiting.

Requirements

A read-write spinlock have the following requirements:

  1. Must allow any number of readers to acquire the lock without contention.
  2. Must allow only a single writer to acquire the lock while:
    • Block future readers.
    • Wait until active readers release their locks.

Allowing a writer to acquire the lock while there’s still active readers is important, as it prevents from write-starvation; that is, if the lock was designed to wait until there would be no active readers before allowing a writer to acquire the lock, it could happen that a writer would never be able to acquire the lock, leading to write-starvation.

Implementation

Our spinlock will be implemented using a 32-bits value. Because only a single writer can acquire the lock, we’ll reserve a single bit to know if the lock is acquired by a writer and the 31 other bits to count the number of active readers (allowing 2^31 readers):

|W|RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR|

The writer bit is carefully placed on the most significant bit, we’ll see why later on.

The implementation will use the compare-and-swap (CAS) atomic operation. If you’re not sure what this operation does, I suggest that you take some time to read my previous post on atomic operations before continuing.

RWSpinLock definition

As described in the previous section, the spinlock contains a single 32-bits value, initialized to zero:

class RWSpinLock
{
    volatile tU32 Lock;

public:
    RWSpinLock()
    : Lock(0)
    {}
};

Acquiring a reader lock

Acquiring the lock for a reader goes like this:

  1. Read the lock value and remove the writer bit (line 5).
  2. Compute the new lock value by incrementing the number of readers (line 6).
  3. Try to atomically exchange the new value with the old one (line 8).
FORCEINLINE void LockReader()
{
    for(;;)
    {
        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = OldLock + 1;

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
            return;
    }
}

Let’s take a look at each operation in details.

Step 1
The first step is to read the current lock value and remove the writer bit. Doing that will ensure that the CAS operation will fail if a writer currently has the lock acquired; if a writer has the lock, the CAS comparison will fail since we’ll be comparing a value without the writer bit to the current value with the writer bit.

Step 2
The second step simply increment the number of readers by one: it takes the expected value (without the writer bit), and adds one.

Step 3
The last step will try to atomically set the new lock value using the CAS operation. Remember that in step 1 we removed the writer bit; that way, if the bit was set before or during our lock tentative, the CAS operation will fail and we’ll have to try again.

Releasing a reader lock

Releasing a reader lock is simple: we just decrement the lock value (line 3):

FORCEINLINE void UnlockReader()
{
    kInterlockedDecrement(&Lock);
}

Remember that the writer bit was carefully placed on the most significant bit? The reason is that even if a writer has acquired the lock during our read operation and is waiting for readers to end, decrementing the lock value will never affect the writer bit. This has the advantage of avoiding a potential CAS loop.

Acquiring the writer lock

Acquiring the lock for a writer goes like this:

  1. Read the lock value and remove the writer bit (line 5).
  2. Compute the new lock value by setting the writer bit (line 6).
  3. Try to atomically exchange the new value with the old one (line 8).
  4. If step 3 worked, wait for active readers to release their locks (line 10).
FORCEINLINE void LockWriter()
{
    for(;;)
    {
        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = (OldLock | 0x80000000);

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
        {
            while(Lock & 0x7fffffff);
            return;
        }
    }
}

Let’s take a look at each operation in details.

Step 1
The first step is to read the current lock value and remove the writer bit. As with acquiring the reader lock, doing that will ensure that the CAS operation will fail if a writer currently has the lock acquired or did so while we were computing new values.

Step 2
The second step simply set the writer bit, to mark the lock as locked by a writer.

Step 3
The third step will try to atomically set the new lock value using the CAS operation. Again, this step if similar to acquiring a reader lock.

Step 4
The last step will simply busy-wait for all readers to release their locks.

Releasing the writer lock

Releasing a writer lock is simple: we just set the lock value to zero (line 4):

FORCEINLINE void UnlockWriter()
{
    kAssert(Lock == 0x80000000);
    Lock = 0;
}

Normally, only the writer bit should be set at this point (no readers should have the lock acquired), so setting the lock value to zero will release the writer lock and let potential waiting readers and/or writers to acquire the lock.

Recursion

Expert readers might have noticed that the code doesn’t fully support recursive locks.

Readers recursion
The code inherently supports recursion for readers: if a thread acquires the reader lock several times, this will simply increment the number of readers properly and will work as expected.

That said, using recursion for the readers lock should be done with great caution: if the lock is acquired for a thread and another thread acquire the writer lock, reader recursion will lead to a deadlock.

Writers recursion
On the other hand, acquiring the writer locks is not recursive: when already locked, any attempt to lock again the writer lock will fail on line 8 of the LockWriter function.

Adding writer recursion is just a matter of deriving our read-write spinlock class to keep track of the current thread-id and recursion depth:

class RWRecursiveSpinLock: public RWSpinLock
{
    tU32 ThreadID;
    tU32 RecursiveDepth;

public:
    FORCEINLINE RWRecursiveSpinLock()
    : ThreadID(-1)
    , RecursiveDepth(0)
    {}
};

Acquiring the writer lock
When the current thread already has the writer lock, we simply increment the recursion depth and return without calling the lock function; other threads trying to acquire the writer lock will wait on line 7:

FORCEINLINE void LockWriter()
{
    tU32 curThreadID = kGetCurrentThreadId();

    if(ThreadID != curThreadID)
    {
        RWSpinLock::LockWriter();
        kAssert(RecursiveDepth == 0);
        ThreadID = curThreadID;
    }

    RecursiveDepth++;
}

Releasing the writer lock
When releasing the writer lock, we first decrement the recursion depth (line 7) and only call the unlock function if the depth is zero (line 12), meaning that this is the last unlock call:

FORCEINLINE void UnlockWriter()
{
    tU32 curThreadID = kGetCurrentThreadId();
    kAssert(ThreadID == curThreadID);

    kAssert(RecursiveDepth > 0);
    RecursiveDepth--;

    if(RecursiveDepth == 0)
    {
        ThreadID = -1;
        RWSpinLock::UnlockWriter();
    }
}

Remember that all this is done under the writer lock, so no atomic operations are required and recursion is safe.

Optimizations

The code presented in this post can be optimized further to reduce its impact on performances.

Reducing atomic operations
Atomic operations aren’t free; so the lock functions should wait until the writer bit is set to zero (line 5):

FORCEINLINE void LockReader()
{
    for(;;)
    {
        while(Lock & 0x80000000);

        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = OldLock + 1;

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
            return;
    }
}

Waiting until the writer bit is set to zero is not a formal promise that we’ll successfully acquire the lock, since other threads might be waiting for the writer lock at the same time, but will greatly reduce the use of atomic operations while doing so.

Busy-waiting
Currently, the presented code is busy-waiting while waiting for locks to be released or when the CAS operation fails; this may not be the best thing to do depending on the situation.

When running on systems with hyper-threading or when there’s more threads than the number of physical processors, it could happen than the spinlock will be waiting for another thread running on the same logical processor. If this situation happens too much, using a kernel lock should be considered.

In situations when a spinlock is wanted, simply yielding the processor to other hyper-threads (using the YieldProcessor macro on MS platforms, for example), has a positive impact on performance (line 6):

FORCEINLINE void LockReader()
{
    for(;;)
    {
        while(Lock & 0x80000000)
            kYieldProcessor();

        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = OldLock + 1;

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
            return;
    }
}

Final optimized code
Here’s what the final lock functions looks like:

FORCEINLINE void LockReader()
{
    for(;;)
    {
        // Wait until there's no active writer
        while(Lock & 0x80000000)
            kYieldProcessor();

        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = OldLock + 1;

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
            return;
    }
}

FORCEINLINE void LockWriter()
{
    for(;;)
    {
        // Wait until there's no active writer
        while(Lock & 0x80000000)
            kYieldProcessor();

        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = (OldLock | 0x80000000);

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
        {
            // Wait for active readers to release their locks
            while(Lock & 0x7fffffff)
                kYieldProcessor();

            return;
        }
    }
}

Conclusion

In this post, I presented a complete, recursive read-write spinlock, along with some hints on when it should be used.

Always remember to profile your code to see if such a lock could be useful in your particular case and make sure it has a positive impact.

Next: Optimizing the recursive read-write spinlock.

Understanding Memory Ordering

March 8, 2012 13 comments

Introduction

In a previous post on atomic operations, I skipped an important topic, which is memory ordering issues.

Behind the hood, the processor might reorder memory operations (reads and writes) for many reasons. This is normally not a problem on programs running on single-core machines. On the other hand, multi-threaded programs running on multi-cores machines can suffer from that: the sequence in which reads and writes operations are performed by the processor can be different from what the order of execution of those operations are from the programmer’s point of view.

Simple Example
Let’s take a look at a simple example:

volatile bool Ready = false;
int Value = 0;

// Thread A
while(!Ready) {}
printf("%d", Value);

// Thread B
Value = 1;
Ready = true;

The expected value to be printed is 1, obviously. But what if Ready end-up written in memory before Value does? The value of Value will not be 1, resulting in random behavior.

Production Code Example
In a previous game I’ve worked on, which was on Xbox 360, we had this piece of code that registered strings in a hash-table using the CRC of the string. Only the main thread was allowed to create new strings in the table, but all threads were allowed to look-up for existing strings.

// Called to register a new string in the hash-table, returning the unique CRC value.
tU32 RegisterString(tChar* String)
{
    // Allocate a new StringDesc from the freelist.
    StringDesc* NewStringDesc = FreeStringDescs.Allocate();

    // Set the string desc info.
    NewStringDesc->String = String;
    NewStringDesc->CRC = StrHash(String);

    // Prepare the entry to be registered into the hash-table.
    tU32 HashBucket = NewStringDesc->CRC & (STRING_DESC_SIZE - 1);
    NewStringDesc->Next = Hash[HashBucket];

    // A memory barrier is required here on some architectures, before registering
    // the new entry into the hash-table. Otherwise, other threads might access the
    // entry before String/CRC/Next members gets written in memory, resulting in
    // random crashes.
    ExportBarrier();

    // Register the allocation in the hashtable. At this point, this entry is
    // visible to all other threads.
    Hash[HashBucket] = NewStringDesc;

    return NewStringDesc->CRC;
}

It took a while to figure-out why we had random crashes with very-low reproduction rate before realizing that we needed a way to ensure that the new entry values were written into memory before actually registering it into the shared hash-table.

As you can see in line 19, a memory barrier was used.

Memory Barriers

Memory barriers are a set of processor instructions used to force pending memory operations to complete before continuing. There are three types of barrier semantics: acquire, release and fence barriers.

Acquire Semantics
Whenever you use atomic operations to gain access to some data, your code must make sure that other processors sees the lock before any other changes that will be made. This is what we call acquire semantics, because the code is trying to acquire ownership of some data. This is also referred as read barriers or import barriers.

  operation 1
  operation 2
<-operation 3-Acquire-> 3 is visible before 4-5
  operation 4
  operation 5

Release Semantics
On the other hand, if you use atomic operations to release recently modified data, your code must make sure that the new data is visible before releasing it. This was exactly the case with the previous string-table example. This is what we call release semantics, because the code is trying to release ownership of some data. This is also referred as write barriers or export barriers.

  operation 1
  operation 2
<-operation 3-Release-> 1-2 are visible before 3
  operation 4
  operation 5

Fence Semantics
A fence semantics combines both acquire and release semantics behavior.

  operation 1
  operation 2
<-operation 3-Fence-> 1-2 are visible before 3, 3 is visible before 4-5
  operation 4
  operation 5

Implementations

As the implementation differs depending on the architecture, it’s a good habit to wrap those functionality into platform-agnostic functions, as seen in the previous examples.

Windows
On Windows, the atomic functions acts as full memory barriers (fences). Each of these functions comes with additional functions with the word “Acquire” and “Release” in their names, that can generate different types of barriers semantics depending if the target platform supports it (such as Itanium).

For example, the InterlockedCompareExchange function also comes as InterlockedCompareExchangeAcquire and InterlockedCompareExchangeRelease.

PowerPC
On PowerPC architectures, such as the Xbox 360 and PS3, memory barriers are implemented using the lwsync instruction. There is no distinction between the different semantics, so it acts as a read/write memory barrier.

Synchronization Primitives

Synchronization primitives requires deep knowledge of the architecture’s memory model and must use memory barriers to prevent memory ordering issues. As a general rule, a programmer should never write its own locks, except in rare situations where special requirements are made.

Spin Lock Example
Let’s take a look at the spinlock I shown in my Lessons learnt while spinning post:

class SpinLock
{
    volatile tInt LockSem;

public:
    FORCEINLINE SpinLock()
    : LockSem(0)
    {}

    FORCEINLINE tBool Lock()
    {
        while(1)
        {
            // Atomically swap the lock variable with 1 if it's currently equal to 0
            if(!InterlockedCompareExchange(&LockSem, 1, 0))
            {
                // We successfully acquired the lock
                ImportBarrier();
                return;
            }
        }
    }

    FORCEINLINE void Unlock()
    {
        ExportBarrier();
        LockSem = 0;
    }
};

An import barrier is required on line 18 when the lock is acquired to make sure other threads don’t see upcoming changes before seeing that the data is locked. Also, an export barrier is required on line 26 to make sure that all the pending memory operations completes before releasing the lock.

The ‘volatile’ C++ Keyword

A quick word on the volatile C++ keyword: according to Microsoft, volatile variables acts as memory barriers: “A write to a volatile object (volatile write) has Release semantics; A read of a volatile object (volatile read) has Acquire semantics”.

Since this is Microsoft-specific, we cannot rely on this behavior on other platforms.

Conclusion

  • Always use the OS-provided synchronizations primitives when locks are required.
  • The volatile keyword should only be used to prevent compiler optimizations, such as keeping a variable in a register while waiting it to changes.
  • Make sure you understand memory barriers whenever implementing multi-cores programs, especially lockless algorithms.
  • When in doubt, use a full barrier (fence).

Memory Management Part 4: Allocators Overhead, Waste and Fragmentation

January 13, 2012 Leave a comment

Introduction

As seen in my previous post on allocations tracking, tracking memory allocations is an essential tool for any medium to large-scale game engine. Every AAA games I’ve worked on for the last 15 years used similar techniques to track and report memory allocations.

But tracking memory allocations alone is not enough. Under the hood, most allocators works by allocating memory pages (usually 4KB or 64KB pages, sometimes 1MB pages) from the OS and then splitting them into smaller chunks, based of the allocators rules and policies. Managing those pages and chunks requires some memory to be used by the allocator itself, resulting in a certain amount of memory not reported by our tracking system.

Let’s take a look at what types of memory a memory allocator can use.

Overhead

Most allocation schemes requires a per chunks header (usually between 8 to 24 bytes) to track the size of the allocations and/or some flags and pointers to other chunks. These headers are usually at the beginning of the memory chunks, so the allocator can easily find them when a pointer is passed to free or realloc by going back a couple of bytes.

Some allocators, especially fixed-size memory allocators, will use a per page header (instead of per chunks) that is usually at the beginning or the end of the page, reducing overhead. Given a particular pointer, they simply align the address to the OS page size and get the header for that page, which contains the size of the chunks, the free-list pointer, etc.

Problematic Case
The overhead will grow with the number of allocations. For example, if a subsystem is doing a lot of small allocations (especially if the allocation size is near or smaller than the overhead per allocation), this can result in a huge percentage of the subsystem’s memory usage by the overhead. Ideally, such subsystems should allocate bigger chunks of memory and split them instead.

Waste

Some space may be wasted in order to satisfy the requirements of the allocators.

Here’s some examples:

  • Depending on the required alignment of the returned chunks, some space may be wasted at the beginning of the chunks to satisfy that alignment.
  • The allocator’s realloc policy may decide to ignore reallocation entirely if the new space is smaller then the current allocated chunk for optimizations reasons, based on some heuristics (if the new size is not smaller than half of the current size, for instance).
  • Fixed-size allocators will return blocks larger than requested. For example, an allocation of 18 bytes could return a chunk size of 32 bytes, if that’s the smaller size that satisfy the allocation size, resulting in a 14 bytes waste.

Problematic Case
Like overhead, doing a lot of small allocations can result in a lot of waste, especially when allocating smaller than the allocator’s minimum chunk size.

Fragmentation

Inevitably, splitting the OS pages into smaller chunks will result in memory fragmentation. Let’s take a look at this memory report for a single 64KB page:

Each block represents a 256 bytes chunk. Depending on the allocator, chunks can be split and coalesced on demand to provide smaller or bigger chunks of memory. As we can see, there is 133 free blocks (for a total of ~33KB bytes free), but the largest contiguous block of free memory is only ~5KB (5376 bytes).

As a result, any allocation larger than 5376 bytes (minus the allocator’s required overhead) will fail in this page, even if there is approximately 33KB free. This is what we call fragmentation.

Here’s some real examples that I came across during the development of the last Xbox 360 game I’ve worked on.

Page Size Allocations
One of our subsystem was temporarily allocating a lot of 64KB blocks for disk streaming. At first, you might ask yourself why this causes fragmentation, since each allocation would use exactly one OS page. The answer depends on the underlying allocator; in our case, the allocator was adding a 24 bytes header to all allocations, with the result that every allocation was in fact allocating two OS pages! The first page was completely used, while the second one had only 24 bytes used.

Between the time these pages were allocated and freed, the empty space in every other pages was used by other allocations. Then, when those temporary buffers were released by the subsystem, every other page couldn’t be freed since there was other allocations in them. The result was sometimes between 1 to 2MB of fragmented free memory inside these pages.

The solution was to directly allocate OS pages using VirtualAlloc instead, without going through our game engine’s allocators. As a side note, VirtualAlloc was wrapped through a global function to allow tracking.

Dynamic Arrays
Depending on the dynamic arrays implementation, adding elements can result is a lot of reallocations. For example, if during the loading process you need to populate a list of all active lights, one might simply add a call to Level->ActiveLights.Add(this) to the light’s loading function. If you have thousand of lights, this means that the dynamic array will possibly need to reallocate itself hundreds of times during the process, leaving bigger and bigger holes in the OS pages scattered all over the place.

A simple solution is to defer this step to the end of the loading process. You simply count the number of active lights, resize your dynamic array once then add all the active lights to it. This can greatly reduce fragmentation.

Threaded Allocations
During the loading process, I noticed that fragmentation was randomly growing around some parts of the code. After several failed attempts to identify the source, I decided to randomly break into the debugger around the time it was happening. I was lucky enough to find out what was the problem in a couple of tries.

The problem was that an external library was doing a lot of small allocations/deallocations, in another thread. After some investigations, I figured out that this library was never allocating more than 256KB at the same time. Since it was heavily allocating and freeing blocks at the same time as the main thread was doing more “permanent” allocations, it was generating a huge amount of fragmentation (around ~2MB) of approximately 8 to 64 bytes blocks scattered all over the OS pages!

The fix to this problem was quite easy to do. Since this library routed all of its allocations through an interface (as described in the first post), I simply routed all of its allocations through a custom free-list that was allocating 64KB OS pages when empty (using VirtualAlloc). This completly fixed the problem, since that library was now working on its own set of OS pages.

Temporary Allocations
Another big source of fragmentation was temporary allocations. The main problem was allocations performed during the loading of the level that were freed at the end of the loading process. Since these temporary allocations were scattered all over the place in the same OS pages as more “permanent” allocations, the fragmentation was quite big (between 10-30MB). Sometimes this meant that, even with ~30MB free, we couldn’t even allocate 32KB of contiguous memory!

The solution to this problem was to provide several allocators:

  • permanent: allocations that will never be freed.
  • loading: allocations that will be freed after the loading of the level.
  • level: allocations that will be freed when unloading the level.
  • frame: allocations that will be freed at the end of the frame.

Implementing this solution can be a huge task (adding a parameter to all allocation methods), but it’s worth the effort if your engine suffer from fragmentation.

When I implemented that solution, I discovered a wonderful side-effect: it helped to track memory leaks. Since the allocators were completely destroyed when their lifetime was reached, I reported all allocations that were still inside the allocator (using the tracking system described in a previous post). For example, between levels, if there was still allocations in the level allocator, all allocation were reported (using the same system described in my previous post on allocations tracking), along with the function name of the allocation. This provided a very efficient way to track memory leaks!

Reporting

Reporting overhead, waste and fragmentation can be tricky, as the best way to get that information is from the allocator itself. If all the allocators from a game engine are custom-made, this is relatively easy to compute, maintain and report.

On the other hand, when using 3rd party allocators like Doug Lea’s dlmalloc, this can be hard to achieve without deep knowledge of the allocator itself. Let’s take a look at the alternatives to keep track of these values without the help of the underlying allocators.

Overhead
Reporting the approximate overhead can be achieved by estimating the overhead per allocations and manually keeping track of it. For example, if you estimate that your 3rd party allocator add 24 bytes of overhead per allocation, increase the overhead counter by 24 for every allocation. This may not be 100% accurate, but it gives you an idea if you have too many small allocations.

Waste
Reporting accurate results for the waste can be achieved if the underlying allocators provides a function that returns the real size of an allocation. For example, if you ask to allocate 18 bytes and the allocator’s GetAllocSize() function returns 32, your memory tracking system can compute that there’s 14 bytes wasted there. This may or may not provide the padding due to alignment, though.

Fragmentation
Reporting the approximate fragmentation without the allocators help can be achieved by subtracting all known values from what the OS returns as the process memory usage. In order to achieve high accuracy though, you also need to take into account all threads stack size, the executable size, etc.

This gives us the amount of memory that is currently available by the allocators from all the allocated OS pages, but doesn’t tell you the largest contiguous block available. The solution that I used when generating memory reports was to start by allocating the amount of memory left, and trying smaller and smaller until one worked. This is not the ideal solution, but worked quite well.

Conclusion

Managing memory in large-scale game engines can be a full-time job, especially when dealing with problems like overhead, waste and fragmentation. The key is to always keep an eye on it, by generating accurate reports and reporting/fixing problems as soon as possible.

Hope this post was helpful to someone!

Trigonometric Look-Up Tables Revisited

December 6, 2011 28 comments

Introduction

In my spare time, I’m working on a 2D game that relies on huge amount of entities being updated and displayed every frame, at 60FPS. This requires to have a permanent eye on overall performances.

For profiling, I’m using an in-house profiler quite similar to what’s described here. Today, I realized that I was spending ~2.37ms calling trigonometric functions per frame, essentially sin and cos to update the transformation matrix of the relevant entities (approximately 50K entities).

According to this document, x86 instructions fsin and fcos takes between 65-100 cycles to execute, so we just need to write some code to compute the sine function using less than that.

Solution

The solution that came to my mind was to use a look-up table (LUT) for sin and cos. Then, I thought to myself that this was an old-school trick that wasn’t much used anymore, but still, I decided to give it a try just to see.

Representing Angles
To avoid a float to integer conversion whenever I need to lookup sines, I quickly converted all my angles in the game to an unsigned 16-bits integer, where [0..65535] is mapped to [0..2pi[. The precision is quite good, and angles gets naturally wrapped to [0..2pi[. Even when doing intermediate computations using larger integer types, we can simply perform a logical and with 65535 to remap the values into this range.

LUT Size
At first, I started with a 16384 entries LUT because I wanted to minimize errors. The sin function looked like this:

FORCEINLINE tFloat Sin(tU16 Value)
{
    const tU16 Index = (Value >> 2);  // Map the angle to [0..16383]
    return SinTable[Index & 16383];   // Lookup angle
}

The results were amazing: instead of ~2.37ms per frame, I was down to ~0.09ms per frame. At this point, there was no turning back.

LUT Steps
I immediately saw a problem with this approach when running the game though: as the LUT was 4 times smaller than the angles range, the resulting value would change only every 4 angle steps, creating a “stepcase” effect (exaggerated to better illustrate the problem):


(This graph was generated with the excellent Inigo Quilez’s Graph Toy).

The yellow line is the LUT sin(x), and the blue one is LUT sin(x+1).

Looking at this graph, I decided to add linear interpolation between Sin(x) and Sin(x+1):

FORCEINLINE tFloat Sin(tU16 Value)
{
    const tU16 Index = (Value >> 2);                  // Map the angle to [0..16383]
    const tFloat A = SinTable[Index & 16383];         // Lookup angle
    const tFloat B = SinTable[(Index + 1) & 16383];   // Lookup angle+1
    const Weight = (Value & 3) / 4.0f;                // Compute weight
    return Weight * (B - A) + A;                      // Linear interpolation
}

With linear interpolation, the game ran smoothly with no noticeable difference, but it approximately doubled the function execution time, which was now ~0.16ms per frame. Still good compared to the initial timings!

LUT Size Optimization
With interpolation enabled, I decided to make some tests to see how small the LUT could be without introducing too much errors. Here’s the results:

As we can see, the error is quite low even for a 64 entries LUT.

Further Optimizations

Here’s some other optimizations that I didn’t investigate. They could be interesting on target platforms that have low memory or if we want to minimize cache misses.

Using Half-Precision Floats
Using half-precision floats can indeed reduce the table size by 2 without sacrificing too much performances, depending on the platform. In fact, this is true for any other types of memory-hungry data, like animations, etc.

Minimal LUT Encoding
It is possible to only encode the first quadrant in the LUT and adjust the angles accordingly, giving us a 4x saving, as described here.

Using Smooth Interpolation
Another solution that I didn’t investigate is to use other types of interpolation, like polynomial or spline interpolation. This could greatly reduce the table size, but would requires a lot more cycles to execute.

Conclusion

Here’s the source code I used for this post, if anyone’s interested in using it:

template<tU32 SinTableSize>; struct TrigLookup
{
    tFloat SinTable[SinTableSize];

    TrigLookup()
    {
        CHECK(IsPowerOfTwo(SinTableSize));

        for(tU32 i=0; i<SinTableSize; i++)
        {
            SinTable[i] = sin(((tFloat)i / SinTableSize) * Pi2);
        }
    }

    FORCEINLINE tFloat Lookup(tU16 Value)
    {
        const tU32 Divisor = (65536 / SinTableSize);
        const tU32 Index = Value / Divisor;
        const tFloat LUTSinA = SinTable[Index & (SinTableSize - 1)];
        const tFloat LUTSinB = SinTable[(Index + 1) & (SinTableSize - 1)];
        const tFloat LUTSinW = (Value & (Divisor - 1)) / (tFloat)Divisor;
        return LUTSinW * (LUTSinB - LUTSinA) + LUTSinA;
    }

    FORCEINLINE tFloat Sin(tU16 Value)
    {
        return Lookup(Value);
    }

    FORCEINLINE tFloat Cos(tU16 Value)
    {
        return Lookup(Value + 16384);
    }

    FORCEINLINE tFloat Tan(tU16 Value)
    {
        return Lookup(Value) / Lookup(Value + 16384);
    }
};

And here’s the assembly dump of the Lookup function, which is roughly ~30 cycles on x86 and might be hand-optimized, if someone is not as lazy as me:

0126FA53  movzx       ecx,si  
0126FA56  mov         eax,ecx  
0126FA58  shr         eax,8  
0126FA5B  mov         edx,eax  
0126FA5D  inc         eax  
0126FA5E  and         edx,0FFh  
0126FA64  and         eax,0FFh  
0126FA69  fld         dword ptr [esp+edx*4+40h]  
0126FA6D  fld         dword ptr [esp+eax*4+40h]  
0126FA71  fsub        st,st(1)  
0126FA73  movzx       eax,cl  
0126FA76  mov         dword ptr [esp+3Ch],eax  
0126FA7A  fild        dword ptr [esp+3Ch]  
0126FA7E  fmul        dword ptr [__real@3b800000 (12D4FC8h)]  
0126FA84  fmulp       st(1),st  
0126FA86  faddp       st(1),st  
Categories: Mathematics, Programming Tags:

Understanding Atomic Operations

November 30, 2011 14 comments

Introduction

Atomic operations are the building blocks of synchronization primitives and non-blocking algorithms. They guarantee that, when modifying a memory location, it will happen without any interference from other threads. They are required whenever writing applications for multiple-cores architectures.

Consider this function:

void NonAtomicAND(tS32* Value, tS32 Op)
{
    *Value &= Op;
}

Which translates to this on x86:

mov eax, dword ptr [Value]  
mov ecx, dword ptr [eax]  
and ecx, dword ptr [Op]  
mov dword ptr [eax], ecx  

As you can see, we read the value in a register on line 2. Then we perform the operation on line 3 and finally update the memory location with the new value on line 4. What would happen if another thread updates the same memory location while we’re holding the new value in a register? That’s it: undefined behavior. We need a way to know if the memory location was modified between the time we performed the read and the write.

Hardware Implementations

Depending on the platform, there’s two atomic operations hardware implementations: Compare-And-Swap (CAS) on x86 and Load-Link/Store-Conditional (LL/SC) on Alpha, PowerPC, MIPS and ARM.

CAS

CAS compares a memory location with a given value, and if they are the same the new value is set. The return value is the value before the swap was attempted. That way, we can know if the memory location was written to between our read and write, and repeat the operation if it was.

Atomicity is guaranteed when used like this:

Read the original value from a memory location.
Compute the new value to be set.
Set the new value only if the memory location is still the original value.

On x86, the lock instruction prefix makes some instructions (ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG) atomics. The CMPXCHG instruction is used to implement CAS.

Here’s the atomic version of the NonAtomicAND function we’ve seen in the introduction, using CAS:

void AtomicAND(volatile tS32* Value, tS32 Op)
{
    while(1)
    {
        const tS32 OldValue = *Value;
        const tS32 NewValue = OldValue & Op;

        // If the result is the original value, the new value was stored.
        if(CAS(Value, NewValue, OldValue) == OldValue)
        {
            return;
        }
    }
}

As you can see, we first read the original value and compute the new one. Then, we try to store the new value only if the current memory location is still the original one. If it changed, we need to repeat the operation until we succeed.

The ABA Problem
Be aware of the ABA problem though: between the time that you read the original value and try to swap it with the new one, it could have been changed to something else and back to the original value. In that case, that change will not be detected.

For simple operations, like the AtomicAND function, this isn’t a problem since the resulting value is still valid in the end. But when implementing lock-free algorithms such as queues or linked-lists, this will cause unwanted behavior.

The usual solution to this problem is to append a counter to the values which is incremented at each operation. That way, A-B-A becomes A1-B2-A3 and the changes is properly detected. It may not be as easy as it looks though and may requires 64/128 bits CAS instructions, especially when working with pointers.

LL/SC

LL/SC works differently. It is implemented using two instructions (lwarx (LL)/stwcx (SC) on PowerPC): LL load and reserve the memory location, while SC store the new value only if the memory location is still reserved. The memory reservation is lost whenever it gets written to. For this reason, LL/SC does not suffer from the ABA problem.

Here’s the atomic version of the NonAtomicAND function we’ve seen in the introduction, using LL/SC:

void AtomicAND(volatile tS32* Value, tS32 Op)
{
    while(1)
    {
        const tS32 NewValue = __lwarx(Value) & Op;

        // If the reservation was still valid, new value was stored.
        if(__stwcx(Value, NewValue))
        {
            return;
        }
    }
}

Reservation Granularity
Depending on the architecture, the reservation is performed on aligned words or, in the worst case, on the whole cache-lines. On current-generation PowerPC consoles, the reservation granularity is cache-lines, so special care must be taken to avoid false sharing or the performance hit can be dramatic.

CAS Emulation
Implementing CAS using LL/SC instructions may be tempting to maintain platform-agnostic code. Doing so adds a compare and a branch though, which could be optimized depending on the compiler and optimization level.

Here’s the code for AtomicAND from the native LL/SC implementation:

loop:   lwarx   r6,0,r3          # Load and create reservation
        and     r4,r6,r5         # Compute the new value
        stwcx   r4,0,r3          # Store the new value if the reservation is still valid
        bne     loop             # Loop if the reservation was invalidated

And here’s the one using the CAS emulation:

loop:   lwz     r8,0(r3)         # Load the original value
        and     r4,r8,r5         # Compute the new value
        lwarx   r6,0,r3          # Load and create reservation
        cmpw    r8,r6            # CAS comparison
        bne     loop             # Retry if not equal
        stwcx   r4,0,r3          # Store the new value if the reservation is still valid
        bne     loop             # Loop if the reservation was invalidated

As we can see, CAS emulation is slower. Depending on the usage, using native LL/SC might help in some edge cases.

Performance

If threads competition is low, the compare loops should almost never loop. However, when used very concurrently by a high number of threads, the loop count can be quite high and can even lead to a livelock, where a thread is never able to set the new value (or takes a large amount of time to do so) due to other threads always modifying it at the same time. Fixing this particular issue might require algorithm-level refactoring.

Conclusion

In an upcoming post, I’ll talk about memory ordering issues, which is especially important when using atomic operations on some platforms.

Memory Management Part 3: Memory Allocators Design

October 22, 2011 5 comments

Introduction
Writing a simple memory allocator is quite easy. Writing an efficient allocator that tries to minimize fragmentation, overhead and locking while trying to maximize locality of reference and performance is an extremely hard task, even for experienced programmers.

In this post, I’ll cover the basics of designing a simple memory allocator. The technique that will be used is called heap layers which is desribed in the “Composing High-Performance Memory Allocators” excellent paper.

Essentially, what the authors propose is to create a set of elementary layers that perform simple operations on the memory they allocates, and compose those layers together to create complex allocators. Instead of using virtual calls to forward calls to top layers, they use C++ mixins, described in the “Mixin-Based Programming in C++” paper. Mixins are templated classes that can have any parent class, making it possible to combine layers and have inlining when calling parent functions.

For example, Doug Lea’s dlmalloc is a very popular allocator and widely used in various software. Understanding the 2000+ lines of code to be able to modify it is hard to achieve and is error-prone. On the other hand, the authors of the heap layers paper have reproduced dlmalloc‘s behavior using a couple of layers, in around 500 lines of code, with similar performances and memory usage. The main advantage of this technique is that it makes it really easy to test new changes to the allocator.

The allocator that will be described in this post will look like this:

typedef StrictSegHeap<
    10, 
    StrictSegHeapTraits,
    ThreadSafeHeap<PagedFreeList>,
    ThreadSafeHeap<CRTMallocHeap>
> GlobalHeap;

A strict segregated heap of 10 fixed-sizes paged free-lists for small allocations, using CRT malloc for large allocations, with a lock per size instead of a single lock to minimize locking. The layers provided here are for the purpose of the post, so readability and ease of understanding were traded with performance in the majority of the cases.

Let’s get started!

Basic Allocator
We start by writing a simple memory allocator that gets it’s memory from CRT malloc:

class CRTMallocHeap
{
public:
    inline void* Malloc(tU32 Size)
    {
        return malloc(Size);
    }

    inline void* Realloc(void* Ptr, tU32 Size)
    {
        return realloc(Ptr, Size);
    }

    inline void Free(void* Ptr)
    {
        return free(Ptr);
    }
};

Then, we plug this into our 3 memory functions described in previous posts (see highlighted lines):

typedef CRTMallocHeap GlobalHeap;

GlobalHeap GHeap;

void* _Malloc(tU32 Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line)
{
    void* Result = GHeap.Malloc(Size);
    RegisterAlloc(Result, Size, AllocType, Desc, File, Line);
    return Result;
}

void* _Realloc(void* Ptr, tU32 Size, const tChar* File, tU32 Line)
{
    void* Result = GHeap.Realloc(Ptr, Size);
    UpdateAlloc(Ptr, Result, Size, File, Line);
    return Result;
}

void _Free(void* Ptr)
{
    UnregisterAlloc(Ptr);
    return GHeap.Free(Ptr);
}

We are now ready to implement our first memory allocator layer.

Free List Allocator
Remember the FreeList class desribed in the second post? It essentially allocates memory pages from the OS using VirtualAlloc and split them into fixed-size chunks. Then, all allocations that fits into that chunk size can be allocated using the free-list extremely fast.

Let’s write a heap layer for that:

class PagedFreeList
{
    tU8* FirstFree;

public:
    PagedFreeList()
    : FirstFree(NULL)
    {}

    inline void* Malloc(tU32 Size)
    {
        if(!FirstFree)
        {
            const tU32 PageSize = 65536;
            const tU32 NumAllocPerBatch = PageSize / Size;

            // Allocate a 64K page from the OS
            tU8* AllocBatch = (tU8*)VirtualAlloc(NULL, PageSize, MEM_COMMIT, PAGE_READWRITE);

            for(tU32 i=0; i<NumAllocPerBatch; i++)
            {
                Free(AllocBatch);
                AllocBatch += Size;
            }
        }

        tU8* Result = FirstFree;
        FirstFree = *((tU8**)FirstFree);
        return Result;
    }

    inline void Free(void* Ptr)
    {
        *(tU8**)Ptr = FirstFree;
        FirstFree = (tU8*)Ptr;
    }
};

Size Heap
Some heaps need to know the size of an allocation in order to route this allocation to the good super-heap. A quick way to achieve that is to write a simple layer that allocates a small header along with each allocation and store the size of the allocation into that header:

template<class SuperHeap> class SizeHeap: public SuperHeap
{
    struct FreeObject
    {
        tU32 Size;
    };

public:
    inline void* Malloc(tU32 Size)
    {
        FreeObject* Ptr = (FreeObject*)SuperHeap::Malloc(Size + sizeof(FreeObject));
        if(Ptr)
        {
            Ptr->Size = Size;
        }
        return (void*)(Ptr + 1);
    }

    inline void* Realloc(void* Ptr, tU32 Size)
    {
        FreeObject* NewPtr = (FreeObject*)SuperHeap::Realloc(Ptr ? (FreeObject*)Ptr - 1 : NULL, Size + sizeof(FreeObject));
        if(NewPtr)
        {
            NewPtr->Size = Size;
        }
        return (void*)(NewPtr + 1);
    }

    inline void Free(void* Ptr)
    {
        SuperHeap::Free((FreeObject*)Ptr - 1);
    }

    inline tU32 GetSize(void* Ptr)
    {
        return ((FreeObject*)Ptr - 1)->Size;
    }
};

This heap requires 4 bytes per allocation. Other schemes could be used to minimize this overhead; for example, the PagedFreeList could implement GetSize() by keeping track of the page allocation size and return this value, as a page always contains same size allocations.

ThreadSafe Heap
If we want to add thread-safety to our layers, we simply need to implement a ThreadSafeHeap layer that provides a Lock() and Unlock() functions, like this:

template<class SuperHeap> class ThreadSafeHeap: public SuperHeap
{
    volatile tS32 LockSem;

    inline void Lock()
    {
        while(1)
        {
            // Atomically swap the lock variable with 1 if it's currently equal to 0
            if(!InterlockedCompareExchange(&LockSem, 1, 0))
            {
                // We successfully acquired the lock
                MemoryBarrier();
                return;
            }

            YieldProcessor();
        }
    }

    inline void Unlock()
    {
        MemoryBarrier();
        LockSem = 0;
    }

public:
    ThreadSafeHeap()
    : LockSem(0)
    {}

    inline void* Malloc(tU32 Size)
    {
        Lock();
        void* Result = SuperHeap::Malloc(Size);
        Unlock();
        return Result;
    }

    inline void* Realloc(void* Ptr, tU32 Size)
    {
        Lock();
        void* Result = SuperHeap::Realloc(Ptr, Size);
        Unlock();
        return Result;
    }

    inline void Free(void* Ptr)
    {
        Lock();
        SuperHeap::Free(Ptr);
        Unlock();
    }
};

Then, adding this layer on top of another one makes it thread-safe. Simply by rearranging where we place this layer can change from a slow, single lock:

typedef ThreadSafeHeap<
    StrictSegHeap<
        10,
        StrictSegHeapTraits,
        PagedFreeList,
        CRTMallocHeap
    >
> GlobalHeap;

to a lock per-layer:

typedef StrictSegHeap<
    10, 
    StrictSegHeapTraits,
    ThreadSafeHeap<PagedFreeList>,
    ThreadSafeHeap<CRTMallocHeap>
> GlobalHeap;

Using a single lock will block whenever two or more threads uses the allocator at the same time. On the other hand, using a lock per layer as shown in the last code sample will reduce the chances of locking only if two or more threads allocates the same size range at the same time.

Strict Segregated Heap
The last layer that will be shown is a strict segregated layer: a layer that separates allocations by sizes using sub-heaps. This layer is strict because it doesn’t perform blocks splitting or coalescing (as dlmalloc does, for instance).

First, we need to define how the sizes will be separated. For this example, I chosed simple powers of two between 8 and 4096:

1..8
9..16
17..32
33..64
65..128
129..256
257..512
513..1024
1025..2048
2049..4096

Which translates into code like this:

struct StrictSegHeapTraits
{
    // Return the bin that this size falls into
    static inline tU32 GetSizeClass(tU32 Size)
    {
        INT c = 0;
        Size--;
        while(Size > 7)
        {
            Size >>= 1;
            c++;
        }
        return c;
    }

    // Return the size of allocations in this bin
    static inline tU32 GetClassMaxSize(tU32 Bin)
    {
        return 1 << (Bin + 3);
    }
};

And the strict segregated heap looks like this:

template<tU32 NumBins, class Traits, class LittleHeap, class BigHeap> class StrictSegHeap: public BigHeap
{
    // The array of little heaps (bins), one for each allocation size
    LittleHeap LittleHeap[NumBins];

    // Return the bin that this allocation falls into
    inline tu32 InnerGetSizeClass(tu32 Size)
    {
        if(Size > Traits::GetClassMaxSize(NumBins - 1))
        {
            return NumBins;
        }
        return Traits::GetSizeClass(Size);
    }

public:
    StrictSegHeap()
    {}

    inline void* Malloc(tU32 Size)
    {
        tU32 SizeClass = InnerGetSizeClass(Size);

        if(SizeClass >= NumBins)
        {
            // Allocation is too big, route to big heap
            return BigHeap::Malloc(Size);
        }

        tU32 MaxSizeClass = Traits::GetClassMaxSize(SizeClass);
        return LittleHeap[SizeClass].Malloc(MaxSizeClass);
    }

    inline void* Realloc(void* Ptr, tU32 Size)
    {
        if(Ptr)
        {
            if(Size)
            {
                tU32 ObjectSize = GetSize(Ptr);
                tU32 ObjectSizeClass = InnerGetSizeClass(ObjectSize);

                if(ObjectSizeClass >= NumBins)
                {
                    return BigHeap::Realloc(Ptr, Size);
                }

                // Loose reallocation: only realloc if bigger or at least twice smaller
                if((Size > ObjectSize) || (Size < ObjectSize / 2))
                {
                    void* NewPtr = Malloc(Size);

                    if(NewPtr)
                    {
                        tU32 CopySize = Min(ObjectSize, Size);
                        memcpy(NewPtr, Ptr, CopySize);
                    }

                    LittleHeap[ObjectSizeClass].Free(Ptr);

                    return NewPtr;
                }

                return Ptr;
            }

            Free(Ptr);
            return NULL;
        }

        return Malloc(Size);
    }

    inline void Free(void* Ptr)
    {
        if(Ptr)
        {
            INT ObjectSize = GetSize(Ptr);
            INT ObjectSizeClass = InnerGetSizeClass(ObjectSize);

            if(ObjectSizeClass >= NumBins)
            {
                BigHeap::Free(Ptr);
            }
            else
            {
                LittleHeap[ObjectSizeClass].Free(Ptr);
            }
        }
    }
};

This heap layer routes small allocations into their corresponding sub-heaps and big allocations into the big heap. As we can see, the code relies on the fact that the sub-heaps provide the GetSize() function to know the size of incoming allocations for Realloc and Free, so it has to be provided or the code won’t compile.

Conclusion
Using the heap layers technique, we can now implement and test new layers quite easily to compose complex allocators easily. There’s a lot of utility layers that can be written; for example, a debug layer that keeps a header and a footer for each allocations to detect memory overwrites. Even the memory tracking system described in the second post could be written as a layer. Make sure to read the provided papers if you read that far!

Using the Moving Average Filter for Profiling Stats

October 19, 2011 Leave a comment

Introduction
When displaying profiling stats or FPS counters or whatever value that varies a lot frame to frame, you don’t want to show the real values since they can vary too much and make the stats unreadable.

Displaying an average over a certain period of time is a good idea. But how to compute that average?

Simple Moving Average
A common solution is to keep the n last values, and then compute the average of these n values to display. This is called a simple moving average and is computed like this:

tFloat MeanValue = 0.0f;
for(tU32 i=0; i<n; i++)
    MeanValue += RecentValues[i];
MeanValue /= n;

It works fine in some scenarios, like displaying a FPS counter. On the other hand, it’s not very appropriate when a lot of values needs to be computed, like for profiling stats, as the amount of data can become too large and the time spent updating the values can become too high.

Exponential Moving Average
Another solution that I’ve been using for years now is the exponential moving average. Essentially, it’s moving the current mean value toward the actual value every iteration, by a certain percentage. It’s computed like this:

MeanValue = MeanValue * 0.99f + CurrentValue * 0.01f;

The advantage of this solution is that you don’t need to keep the n last values, as opposed to the simple moving average.

Comparison
Here’s a graph comparing the two solutions:

20111019-133655.jpg

As you can see at the beginning of the graph, the EMA takes some time to react to big changes. This is easily fixed by initializing the mean value to the first value, but wasn’t done here to show how it reacts to big changes in the source data.

Conslusion
Compared to the SMA, the EMA is closer to the real averaged value, has less variations and takes less memory and CPU, which makes it a good function to compute and display averaged values in a profiling system.

Categories: Mathematics, Programming Tags:

Hashing Strings and Pointers – Avoiding Common Pitfalls

October 12, 2011 7 comments

Introduction
In a previous post, I suggested using fixed size hash tables to map strings and pointers. Since this code really needs to be fast, reducing hash collisions probabilities is very important to maximize the hash table performances. But there is other pitfalls that needs to be addressed too.

Hashing Strings
The most common error when hashing strings is to perform a string comparison (using strcmp) when traversing the hash bin list. Even if the hash function is quite bad and has a very high probability of generating the same hash value for different strings, this is not required at all.

For my tests, I hashed the complete dictionary found in /usr/dict/words which contained 212710 unique words. Here’s some of the results:

hairbrush hash_value=0x7f3ef2bb hash_bin=0x32bb
encomiums hash_value=0xa758b2bb hash_bin=0x32bb
engrained hash_value=0x32abb2bb hash_bin=0x32bb
   seavir hash_value=0x32abb2bb hash_bin=0x32bb

As you can see, all of these strings ended up in the same hash bin 0x32bb, but have different hash values (except for ‘engrained’ and ‘seavir’).

In the hash table, it looks like this:

[0x32ba] phrenologically(0x888fb2ba)->Alfreda(0xee9432ba)->...
[0x32bb] encomiums(0xa758b2bb)->engrained(0x32abb2bb)->seavir(0x32abb2bb)->...
[0x32bc] centillion(0xb44232bc)->cyclometer(0xc5a172bc)->...

As an optimization, we can only compare the original hash value (ie the value before it was made to fit the hash table size) until we found a match. Then, we need to do a string comparison in order to make sure that the strings match. This greatly optimize the hash performances and can be used with all non-PODs data types.

Attentive readers might have noticed that, in my previous post on memory tracking, I totally skipped the string comparison when hashing the allocation tags. I decided to trade possible errors due to hash value collisions with execution speed, simply because the hash function is so efficient that I never hit a single collision.

The hash function I’m using was published in Dr. Dobb’s Algorithm Alley by Bob Jenkins, in September 1997. I’ve been using this function in several games over the years and never hit a single hash collision, sometimes hashing hundreds of thousands of unique strings.

Hashing Pointers
When hashing pointers, special care must be taken due to alignment. It is tempting to simply use the pointer itself to index in the hash table, but this is a very bad idea, as pointers are mostly aligned on 4, 8, 16 bytes or even more. Alignment makes the lowest bits set to 0, and depending on the hash table size, it may lead to a non-uniform distribution, thus reducing the hash table to a simple linked-list in the worst case.

Over the years, I’ve been using Thomas Wang’s hash function to hash pointers on several platforms and it worked very well in all cases:

tU32 HashPointer(void* Ptr)
{
    tU32 Value = (tU32)Ptr;
    Value = ~Value + (Value << 15);
    Value = Value ^ (Value >> 12);
    Value = Value + (Value << 2);
    Value = Value ^ (Value >> 4);
    Value = Value * 2057;
    Value = Value ^ (Value >> 16);
    return Value;
}

Conclusion
Always test your hash functions to make sure distribution is uniform, especially for complex types (strings, GUIDs, etc). Pay special attention to types that may vary on different platforms (pointers, etc). Also, whenever possible, prefer a fixed-size hash table that fits the data you’re hashing, as opposed to using a templated, varying size one that may use more memory and won’t allow custom traversal comparison code.

Categories: Programming Tags:

Memory Management Part 2: Allocations Tracking

October 6, 2011 7 comments

Introduction
In this post, I’ll cover the basics of memory allocations tracking. If you haven’t read my previous post on allocation strategy, make sure to read it before continuing.

Here’s what we want to achieve:

  • Keep track of every single allocation in the game with its allocation source (file, line, etc).
  • Categorize every allocation into high-level categories, such as Textures, Meshes, AI, etc.
  • Track memory leaks and generate categorized memory usage reports.
  • Minimize the impacts on the game’s performance, memory usage and fragmentation.

Minimizing Impacts
Keeping track of all allocations in a game engine might use a large amount of memory by itself; when tracking memory for usage reports or leaks, this may not be a problem. On the other hand, when using the system to gather information on fragmentation, it is crucial that it has a minimum impact on memory usage, or it may alter the results.

How to achieve that? The obvious answer is that the tracking system should not allocate its memory from the game engine’s allocators. Here we introduce the FreeList class:

template<class T> class FreeList
{
    T* Free;

public:
    FreeList()
    : Free(NULL)
    {}

    inline T* New()
    {
        if(!Free)
        {
            const tU32 AllocSize = 65536;
            const tU32 NumAllocPerBatch = AllocSize / sizeof(T);

            T* AllocBatch = (T*)VirtualAlloc(NULL, AllocSize, MEM_COMMIT, PAGE_READWRITE);

            for(tU32 i=0; i<NumAllocPerBatch; i++)
            {
                Delete(AllocBatch++);
            }
        }

        T* Result = Free;
        Free = *((T**)Free);
        return Result;
    }

    inline void Delete(T* Ptr)
    {
        *(T**)Ptr = Free;
        Free = Ptr;
    }
};

A free-list is a very important concept when designing memory allocators. The particularity of this implementation is that it will allocates 64KB pages directly from the OS whenever it is empty. It then uses the first bytes of each free elements that fits in the page to link to other free elements.

The fact that it allocates directly from the OS ensure that it doesn’t interfere with normal memory allocations in the engine, and minimize fragmentation since it allocates contiguous elements in OS pages.

In heavy usage scenarios, a FreeList will link free elements between multiple pages, but still it shouldn’t affect fragmentation, since it doesn’t share the same OS pages as the game engine’s allocators.

Tracking Allocations Hook
Now we’re ready to track allocations. The first step is to “hijack” our 3 memory functions we defined in the first part (lines 4, 11 and 17):

void* _Malloc(tU32 Size, tU32 AllocType, const tChar* Desc, const tChar* File, tU32 Line)
{
    void* Result = malloc(Size);
    RegisterAlloc(Result, Size, AllocType, Desc, File, Line);
    return Result;
}

void* _Realloc(void* Ptr, tU32 Size, const tChar* File, tU32 Line)
{
    void* Result = realloc(Ptr, Size);
    UpdateAlloc(Ptr, Result, Size, File, Line);
    return Result;
}

void _Free(void* Ptr)
{
    UnregisterAlloc(Ptr);
    return free(Ptr);
}

Implementation
Next, we need to implement RegisterAlloc, UpdateAlloc and UnregisterAlloc. The first step is to define the allocation structure AllocDesc that will hold information on a single allocation, and the allocation category TagDesc that will hold the allocation’s category information, such as the category’s name (the Desc string provided with each allocation call), the total allocated size, etc:

struct AllocDesc
{
    void* Ptr;        // Allocation address
    tU32 Size;        // Allocation size
    TagDesc* Tag;     // Pointer to the allocation category
    AllocDesc* Next;  // Next allocation in the hashtable
}
struct TagDesc
{
    tU32 CRC;         // Category description string CRC
    tU32 Size;        // Total size of all allocations in this category
    tChar Tag[128];   // Category name
    TagDesc* Next;    // Next category in the hashtable
}

The steps for registering an allocation looks like this:

  1. Create a new AllocDesc from a FreeList<AllocDesc> (line 4).
  2. Register this new allocation into its category (line 13).
  3. Register the new AllocDesc into the hashtable (lines 16-18).
void RegisterAlloc(void* Ptr, tU32 Size, const tChar* Desc, const tChar* File, tU32 Line)
{
    // Allocate a new AllocDesc from the freelist
    AllocDesc* NewAllocDesc = FreeAllocs.Allocate();

    // Set the allocation info
    NewAllocDesc->Ptr = Ptr;
    NewAllocDesc->Size = Size;
    NewAllocDesc->File = File;
    NewAllocDesc->Line = Line;

    // Register the allocation into its category (this updates the category total size)
    NewAllocDesc->Tag = TagDesc::Register(Desc, Size);

    // Register the allocation in the fixed-size hashtable
    tU32 PtrHash = GetPtrHash(Ptr) & (ALLOC_DESC_SIZE - 1);
    NewAllocDesc->Next = Hash[PtrHash];
    Hash[PtrHash] = NewAllocDesc;
}

Registering a new category goes like this:

  1. Lookup in the hashtable for the category (lines 3-17).
  2. If it already exist, increment the category’s size by the allocation size (line 12).
  3. If it doesn’t exist, create a new one from a FreeList<TagDesc> (line 20).
  4. Register the new TagDesc into the hashtable (lines 31-32).
static TagDesc* TagDesc::Register(const tChar* Tag, tU32 Size)
{
    tU32 TagCRC = Strhash(Tag);
    tU32 TagHash = TagCRC & (ALLOC_DESC_SIZE - 1);
    TagDesc* CurTag = CurTag = Hash[TagHash];

    while(CurTag)
    {
        if(CurTag->CRC == TagCRC)
        {
            // This category already exist, update its size and return it
            CurTag->Size += Size;
            return CurTag;
        }

        CurTag = CurTag->Next;
    }

    // Allocate a new TagDesc from the freelist
    CurTag = FreeTags.Allocate();

    // Set the category info
    CurTag->CRC = TagCRC;
    CurTag->Size = Size;

    tU32 TagLen = Min(strlen(Tag), 127);
    memcpy(CurTag->Tag, Tag, TagLen);
    CurTag->Tag[TagLen] = 0;

    // Register the category in the fixed-size hashtable
    CurTag->Next = Hash[TagHash];
    Hash[TagHash] = CurTag;

    return CurTag;
}

UpdateAlloc and UnregisterAlloc are straightforward. Simply update the allocation and category accordingly.

Analyzing the Data
Then comes the fun part: dumping and analyzing the gathered information. I usually dump all the data into a comma-separated values compatible format, because I tend to use Excel to analyze the data:

void DumpAllocs()
{
    printf("Address,Category,CategorySize,AllocSize\n");
    for(tU32 i=0; i<ALLOC_DESC_SIZE; i++)
    {
        AllocDesc* Alloc = Hash[i];
        while(Alloc)
        {
            printf("0x%08x,%s,%d,%d\n", Alloc->Ptr, Alloc->Tag->Tag, Alloc->Tag->Size, Alloc->Size);
            Alloc = Alloc->Next;
        }
    }
}

Which looks like this when turned into a quick report in Excel:

  Meshes 11336K
  Sounds 41553K
Textures 7867K

Tracking memory leaks
With this memory tracking up and running, it’s easy to setup a quick memory leaks detection system by dumping all allocations at a certain point in the game, then dumping another snapshot later on. Then, by comparing the 2 snapshots, we can quickly show the deltas for each categories and/or even for all allocations. The tricky part is to analyze the data correctly and identify what is normal and what are the real memory leaks.

Conclusion
This essentially covers the basics of memory allocations tracking. A lot of advanced stuff can be added, like tracking memory allocations using the call-stack, on-screen and even over-the-network memory visualization, etc. Simply implement what suits your needs best!