Archive

Posts Tagged ‘Threading’

Optimizing the recursive read-write spinlock

January 12, 2014 7 comments

In my previous post on implementing a recursive read-write spinlock, I briefly talked about trivial optimizations that could be done to improve the time spent in the lock functions.

In this post, we’ll measure these optimizations, propose more optimizations to the read-write spinlock and compare all optimizations results.

Stress tests
The numbers that will be given in this post comes from two stress-test applications that were written to compare the different optimizations. Essentially, the tests consist of spawning some threads and executing huge amount of locks in parallel: a read test that only test the reader locks without any writers involved, and a read-write test that tests reader and writer locks simultaneously.

Measuring optimizations from the first post

The cost of atomic operations
The first optimization that was done in the first post was to reduce the number of atomic operations by busy-waiting while the writer bit was set and only attempt an atomic operation when required.

The measured speedup was ~34% faster in the read-write test. We can clearly see that atomic operations aren’t free.

Busy-waiting
The second optimization that was done in the first post was to yield the processor while waiting.

The measured speedup was ~5% faster in the read-write test; since the test was using the same number of threads than logical cores, the speedup isn’t spectacular. On the other hand, applications with a lot more threads than logical processors would observe a higher performance gain, so this optimization is still valid for all cases.

Readers contention

A thing that I noticed after writing the spinlock post was that it is suffering from readers contention, due to the CAS operations.

Consider this simple example:

Thread-1               Thread-2               Thread-3
1: Read lock value     1: Read lock value     1: Read lock value
2: Increment readers   2: Increment readers   2: Increment readers
3: CAS (succeeds)      3: CAS (fails)         3: CAS (fails)
4: ...                 4: Read lock value     4: Read lock value
5: ...                 5: Increment readers   5: Increment readers
6: ...                 6: CAS (succeeds)      6: CAS (fails)
7: ...                 7: ...                 7: Read lock value
8: ...                 8: ...                 8: Increment readers
9: ...                 9: ...                 9: CAS (succeeds)

This is technically not contention, as each thread is making progress, but it goes against the original purpose of the read-write spinlock mentionned in the first post: must allow any number of readers to acquire the lock without contention.

The key here to solve this problem is to avoid CAS loops by using other types of atomic instructions.

New implementation

In the new implementation, I’ll solely use atomic increments, decrements, adds and subtracts. This will completly remove the CAS contention.

The spinlock will still be implemented using a 32-bits value, but more bits will need to be reserved for the writer (we’ll see why shortly):

|WWWWWWWWWWWW|RRRRRRRRRRRRRRRRRRRR|

The writer bits are still carefully placed on the most significant bits, as described in the previous post.

Acquiring a reader lock

To acquire the reader lock, we atomically increment the lock value and inspect the return value (line 8): if any of writer bits are set, it means that a writer held the lock before us and we failed. In that case, we atomically decrement the lock value and try again (line 11). Otherwise, we acquired the reader lock successfuly.

FORCEINLINE void LockReader()
{
    for(;;)
    {
        // Wait for active writer to release the lock
        while(Lock & 0xfff00000)
            kYieldProcessor();

        if((kInterlockedIncrement(&Lock) & 0xfff00000) == 0)
            return;

        kInterlockedDecrement(&Lock);
    }
}

Releasing a reader lock

Releasing a reader lock with the new implementation can be achieved by atomically decrementing the lock value (same code as previous implementation):

FORCEINLINE void UnlockReader()
{
    kInterlockedDecrement(&Lock);
}

Acquiring the writer lock

To acquire the writer lock, we atomically increment the numbers of writers (using an atomic add) and inspect the return value (line 9): if only the first writer bit was set during this operation, we acquired the writer lock successfuly and wait for any readers to end (line 12). Otherwise, another writer held the lock before us and we failed. In that case, we atomically decrement the number of writers (using an atomic subtract) and try again (line 18).

FORCEINLINE void LockWriter()
{
    for(;;)
    {
        // Wait for active writer to release the lock
        while(Lock & 0xfff00000)
            kYieldProcessor();

        if((kInterlockedAdd(&Lock, 0x100000) & 0xfff00000) == 0x100000)
        {
            // Wait until there's no more readers
            while(Lock & 0x000fffff)
                kYieldProcessor();

            return;
        }

        kInterlockedSub(&Lock, 0x100000);
    }
}

This is why the lock value now needs more bits for the writers part: when several writers try to acquire the lock, the value might get incremented up to at least the number of active threads. When the value isn’t exactly one, the writer failed to acquire the lock and decrement the value before trying again.

In this example, 12 bits were reserved for to writers, allowing up to 4096 concurrent writer threads. This value should be tweaked depending on the application.

Releasing the writer lock

Releasing a writer lock is done by atomically decrementing the number of writers, using an atomic subtract:

FORCEINLINE void UnlockWriter()
{
    kInterlockedSub(&Lock, 0x100000);
}

Optimization results

The measured speedup for the new implementation in the read-write test was ~62%. Remember that this test was only acquiring the reader lock, with no writers involved. This is quite an improvement!

Overall results comparison

Here’s the results of all optimizations described in this post:

  • SpinLock is a standard spinlock.
  • RWLock is the read-write spinlock from the first post, without any optimization.
  • RWLock_1 adds the atomic operations reduction optimization.
  • RWLock_2 adds the yield processor optimization.
  • RWLock_3 is the new implementation, with all optimizations.

Conclusion

Some things to keep in mind:

  • Atomic operations aren’t free; in fact, they cost a lot and should be avoided whenever possible.
  • When writing lockfree algorithms using CAS loops, think twice and ask yourself if this could be done using other atomic operations.

Implementing a recursive read-write spinlock

January 3, 2014 8 comments

Very often I see programmers using a critical section to make some code thread-safe. What happens is that most of the time, the protected code will only read data, while blocking other reading threads and lead to poor performances.

Most of the time, a read-write lock can be used to only block readers when there’s a write operation in progress, greatly reducing the time spent waiting for other threads to release the lock.

Why a spinlock?
In a previous post, I spoke about the dangers of busy-waiting, especially when implementing our own spinlocks on OSs that doesn’t support priority inversion protection mechanisms.

However, writing our own spinlocks implementation can have several advantages, the main one being pedagogic. Other advantages of spinlocks is that they are lightweight and doesn’t trigger a context switch when waiting.

Requirements

A read-write spinlock have the following requirements:

  1. Must allow any number of readers to acquire the lock without contention.
  2. Must allow only a single writer to acquire the lock while:
    • Block future readers.
    • Wait until active readers release their locks.

Allowing a writer to acquire the lock while there’s still active readers is important, as it prevents from write-starvation; that is, if the lock was designed to wait until there would be no active readers before allowing a writer to acquire the lock, it could happen that a writer would never be able to acquire the lock, leading to write-starvation.

Implementation

Our spinlock will be implemented using a 32-bits value. Because only a single writer can acquire the lock, we’ll reserve a single bit to know if the lock is acquired by a writer and the 31 other bits to count the number of active readers (allowing 2^31 readers):

|W|RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR|

The writer bit is carefully placed on the most significant bit, we’ll see why later on.

The implementation will use the compare-and-swap (CAS) atomic operation. If you’re not sure what this operation does, I suggest that you take some time to read my previous post on atomic operations before continuing.

RWSpinLock definition

As described in the previous section, the spinlock contains a single 32-bits value, initialized to zero:

class RWSpinLock
{
    volatile tU32 Lock;

public:
    RWSpinLock()
    : Lock(0)
    {}
};

Acquiring a reader lock

Acquiring the lock for a reader goes like this:

  1. Read the lock value and remove the writer bit (line 5).
  2. Compute the new lock value by incrementing the number of readers (line 6).
  3. Try to atomically exchange the new value with the old one (line 8).
FORCEINLINE void LockReader()
{
    for(;;)
    {
        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = OldLock + 1;

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
            return;
    }
}

Let’s take a look at each operation in details.

Step 1
The first step is to read the current lock value and remove the writer bit. Doing that will ensure that the CAS operation will fail if a writer currently has the lock acquired; if a writer has the lock, the CAS comparison will fail since we’ll be comparing a value without the writer bit to the current value with the writer bit.

Step 2
The second step simply increment the number of readers by one: it takes the expected value (without the writer bit), and adds one.

Step 3
The last step will try to atomically set the new lock value using the CAS operation. Remember that in step 1 we removed the writer bit; that way, if the bit was set before or during our lock tentative, the CAS operation will fail and we’ll have to try again.

Releasing a reader lock

Releasing a reader lock is simple: we just decrement the lock value (line 3):

FORCEINLINE void UnlockReader()
{
    kInterlockedDecrement(&Lock);
}

Remember that the writer bit was carefully placed on the most significant bit? The reason is that even if a writer has acquired the lock during our read operation and is waiting for readers to end, decrementing the lock value will never affect the writer bit. This has the advantage of avoiding a potential CAS loop.

Acquiring the writer lock

Acquiring the lock for a writer goes like this:

  1. Read the lock value and remove the writer bit (line 5).
  2. Compute the new lock value by setting the writer bit (line 6).
  3. Try to atomically exchange the new value with the old one (line 8).
  4. If step 3 worked, wait for active readers to release their locks (line 10).
FORCEINLINE void LockWriter()
{
    for(;;)
    {
        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = (OldLock | 0x80000000);

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
        {
            while(Lock & 0x7fffffff);
            return;
        }
    }
}

Let’s take a look at each operation in details.

Step 1
The first step is to read the current lock value and remove the writer bit. As with acquiring the reader lock, doing that will ensure that the CAS operation will fail if a writer currently has the lock acquired or did so while we were computing new values.

Step 2
The second step simply set the writer bit, to mark the lock as locked by a writer.

Step 3
The third step will try to atomically set the new lock value using the CAS operation. Again, this step if similar to acquiring a reader lock.

Step 4
The last step will simply busy-wait for all readers to release their locks.

Releasing the writer lock

Releasing a writer lock is simple: we just set the lock value to zero (line 4):

FORCEINLINE void UnlockWriter()
{
    kAssert(Lock == 0x80000000);
    Lock = 0;
}

Normally, only the writer bit should be set at this point (no readers should have the lock acquired), so setting the lock value to zero will release the writer lock and let potential waiting readers and/or writers to acquire the lock.

Recursion

Expert readers might have noticed that the code doesn’t fully support recursive locks.

Readers recursion
The code inherently supports recursion for readers: if a thread acquires the reader lock several times, this will simply increment the number of readers properly and will work as expected.

That said, using recursion for the readers lock should be done with great caution: if the lock is acquired for a thread and another thread acquire the writer lock, reader recursion will lead to a deadlock.

Writers recursion
On the other hand, acquiring the writer locks is not recursive: when already locked, any attempt to lock again the writer lock will fail on line 8 of the LockWriter function.

Adding writer recursion is just a matter of deriving our read-write spinlock class to keep track of the current thread-id and recursion depth:

class RWRecursiveSpinLock: public RWSpinLock
{
    tU32 ThreadID;
    tU32 RecursiveDepth;

public:
    FORCEINLINE RWRecursiveSpinLock()
    : ThreadID(-1)
    , RecursiveDepth(0)
    {}
};

Acquiring the writer lock
When the current thread already has the writer lock, we simply increment the recursion depth and return without calling the lock function; other threads trying to acquire the writer lock will wait on line 7:

FORCEINLINE void LockWriter()
{
    tU32 curThreadID = kGetCurrentThreadId();

    if(ThreadID != curThreadID)
    {
        RWSpinLock::LockWriter();
        kAssert(RecursiveDepth == 0);
        ThreadID = curThreadID;
    }

    RecursiveDepth++;
}

Releasing the writer lock
When releasing the writer lock, we first decrement the recursion depth (line 7) and only call the unlock function if the depth is zero (line 12), meaning that this is the last unlock call:

FORCEINLINE void UnlockWriter()
{
    tU32 curThreadID = kGetCurrentThreadId();
    kAssert(ThreadID == curThreadID);

    kAssert(RecursiveDepth > 0);
    RecursiveDepth--;

    if(RecursiveDepth == 0)
    {
        ThreadID = -1;
        RWSpinLock::UnlockWriter();
    }
}

Remember that all this is done under the writer lock, so no atomic operations are required and recursion is safe.

Optimizations

The code presented in this post can be optimized further to reduce its impact on performances.

Reducing atomic operations
Atomic operations aren’t free; so the lock functions should wait until the writer bit is set to zero (line 5):

FORCEINLINE void LockReader()
{
    for(;;)
    {
        while(Lock & 0x80000000);

        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = OldLock + 1;

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
            return;
    }
}

Waiting until the writer bit is set to zero is not a formal promise that we’ll successfully acquire the lock, since other threads might be waiting for the writer lock at the same time, but will greatly reduce the use of atomic operations while doing so.

Busy-waiting
Currently, the presented code is busy-waiting while waiting for locks to be released or when the CAS operation fails; this may not be the best thing to do depending on the situation.

When running on systems with hyper-threading or when there’s more threads than the number of physical processors, it could happen than the spinlock will be waiting for another thread running on the same logical processor. If this situation happens too much, using a kernel lock should be considered.

In situations when a spinlock is wanted, simply yielding the processor to other hyper-threads (using the YieldProcessor macro on MS platforms, for example), has a positive impact on performance (line 6):

FORCEINLINE void LockReader()
{
    for(;;)
    {
        while(Lock & 0x80000000)
            kYieldProcessor();

        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = OldLock + 1;

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
            return;
    }
}

Final optimized code
Here’s what the final lock functions looks like:

FORCEINLINE void LockReader()
{
    for(;;)
    {
        // Wait until there's no active writer
        while(Lock & 0x80000000)
            kYieldProcessor();

        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = OldLock + 1;

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
            return;
    }
}

FORCEINLINE void LockWriter()
{
    for(;;)
    {
        // Wait until there's no active writer
        while(Lock & 0x80000000)
            kYieldProcessor();

        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = (OldLock | 0x80000000);

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
        {
            // Wait for active readers to release their locks
            while(Lock & 0x7fffffff)
                kYieldProcessor();

            return;
        }
    }
}

Conclusion

In this post, I presented a complete, recursive read-write spinlock, along with some hints on when it should be used.

Always remember to profile your code to see if such a lock could be useful in your particular case and make sure it has a positive impact.

Next: Optimizing the recursive read-write spinlock.

Understanding Memory Ordering

March 8, 2012 13 comments

Introduction

In a previous post on atomic operations, I skipped an important topic, which is memory ordering issues.

Behind the hood, the processor might reorder memory operations (reads and writes) for many reasons. This is normally not a problem on programs running on single-core machines. On the other hand, multi-threaded programs running on multi-cores machines can suffer from that: the sequence in which reads and writes operations are performed by the processor can be different from what the order of execution of those operations are from the programmer’s point of view.

Simple Example
Let’s take a look at a simple example:

volatile bool Ready = false;
int Value = 0;

// Thread A
while(!Ready) {}
printf("%d", Value);

// Thread B
Value = 1;
Ready = true;

The expected value to be printed is 1, obviously. But what if Ready end-up written in memory before Value does? The value of Value will not be 1, resulting in random behavior.

Production Code Example
In a previous game I’ve worked on, which was on Xbox 360, we had this piece of code that registered strings in a hash-table using the CRC of the string. Only the main thread was allowed to create new strings in the table, but all threads were allowed to look-up for existing strings.

// Called to register a new string in the hash-table, returning the unique CRC value.
tU32 RegisterString(tChar* String)
{
    // Allocate a new StringDesc from the freelist.
    StringDesc* NewStringDesc = FreeStringDescs.Allocate();

    // Set the string desc info.
    NewStringDesc->String = String;
    NewStringDesc->CRC = StrHash(String);

    // Prepare the entry to be registered into the hash-table.
    tU32 HashBucket = NewStringDesc->CRC & (STRING_DESC_SIZE - 1);
    NewStringDesc->Next = Hash[HashBucket];

    // A memory barrier is required here on some architectures, before registering
    // the new entry into the hash-table. Otherwise, other threads might access the
    // entry before String/CRC/Next members gets written in memory, resulting in
    // random crashes.
    ExportBarrier();

    // Register the allocation in the hashtable. At this point, this entry is
    // visible to all other threads.
    Hash[HashBucket] = NewStringDesc;

    return NewStringDesc->CRC;
}

It took a while to figure-out why we had random crashes with very-low reproduction rate before realizing that we needed a way to ensure that the new entry values were written into memory before actually registering it into the shared hash-table.

As you can see in line 19, a memory barrier was used.

Memory Barriers

Memory barriers are a set of processor instructions used to force pending memory operations to complete before continuing. There are three types of barrier semantics: acquire, release and fence barriers.

Acquire Semantics
Whenever you use atomic operations to gain access to some data, your code must make sure that other processors sees the lock before any other changes that will be made. This is what we call acquire semantics, because the code is trying to acquire ownership of some data. This is also referred as read barriers or import barriers.

  operation 1
  operation 2
<-operation 3-Acquire-> 3 is visible before 4-5
  operation 4
  operation 5

Release Semantics
On the other hand, if you use atomic operations to release recently modified data, your code must make sure that the new data is visible before releasing it. This was exactly the case with the previous string-table example. This is what we call release semantics, because the code is trying to release ownership of some data. This is also referred as write barriers or export barriers.

  operation 1
  operation 2
<-operation 3-Release-> 1-2 are visible before 3
  operation 4
  operation 5

Fence Semantics
A fence semantics combines both acquire and release semantics behavior.

  operation 1
  operation 2
<-operation 3-Fence-> 1-2 are visible before 3, 3 is visible before 4-5
  operation 4
  operation 5

Implementations

As the implementation differs depending on the architecture, it’s a good habit to wrap those functionality into platform-agnostic functions, as seen in the previous examples.

Windows
On Windows, the atomic functions acts as full memory barriers (fences). Each of these functions comes with additional functions with the word “Acquire” and “Release” in their names, that can generate different types of barriers semantics depending if the target platform supports it (such as Itanium).

For example, the InterlockedCompareExchange function also comes as InterlockedCompareExchangeAcquire and InterlockedCompareExchangeRelease.

PowerPC
On PowerPC architectures, such as the Xbox 360 and PS3, memory barriers are implemented using the lwsync instruction. There is no distinction between the different semantics, so it acts as a read/write memory barrier.

Synchronization Primitives

Synchronization primitives requires deep knowledge of the architecture’s memory model and must use memory barriers to prevent memory ordering issues. As a general rule, a programmer should never write its own locks, except in rare situations where special requirements are made.

Spin Lock Example
Let’s take a look at the spinlock I shown in my Lessons learnt while spinning post:

class SpinLock
{
    volatile tInt LockSem;

public:
    FORCEINLINE SpinLock()
    : LockSem(0)
    {}

    FORCEINLINE tBool Lock()
    {
        while(1)
        {
            // Atomically swap the lock variable with 1 if it's currently equal to 0
            if(!InterlockedCompareExchange(&LockSem, 1, 0))
            {
                // We successfully acquired the lock
                ImportBarrier();
                return;
            }
        }
    }

    FORCEINLINE void Unlock()
    {
        ExportBarrier();
        LockSem = 0;
    }
};

An import barrier is required on line 18 when the lock is acquired to make sure other threads don’t see upcoming changes before seeing that the data is locked. Also, an export barrier is required on line 26 to make sure that all the pending memory operations completes before releasing the lock.

The ‘volatile’ C++ Keyword

A quick word on the volatile C++ keyword: according to Microsoft, volatile variables acts as memory barriers: “A write to a volatile object (volatile write) has Release semantics; A read of a volatile object (volatile read) has Acquire semantics”.

Since this is Microsoft-specific, we cannot rely on this behavior on other platforms.

Conclusion

  • Always use the OS-provided synchronizations primitives when locks are required.
  • The volatile keyword should only be used to prevent compiler optimizations, such as keeping a variable in a register while waiting it to changes.
  • Make sure you understand memory barriers whenever implementing multi-cores programs, especially lockless algorithms.
  • When in doubt, use a full barrier (fence).

Understanding Atomic Operations

November 30, 2011 11 comments

Introduction

Atomic operations are the building blocks of synchronization primitives and non-blocking algorithms. They guarantee that, when modifying a memory location, it will happen without any interference from other threads. They are required whenever writing applications for multiple-cores architectures.

Consider this function:

void NonAtomicAND(tS32* Value, tS32 Op)
{
    *Value &= Op;
}

Which translates to this on x86:

mov eax, dword ptr [Value]  
mov ecx, dword ptr [eax]  
and ecx, dword ptr [Op]  
mov dword ptr [eax], ecx  

As you can see, we read the value in a register on line 2. Then we perform the operation on line 3 and finally update the memory location with the new value on line 4. What would happen if another thread updates the same memory location while we’re holding the new value in a register? That’s it: undefined behavior. We need a way to know if the memory location was modified between the time we performed the read and the write.

Hardware Implementations

Depending on the platform, there’s two atomic operations hardware implementations: Compare-And-Swap (CAS) on x86 and Load-Link/Store-Conditional (LL/SC) on Alpha, PowerPC, MIPS and ARM.

CAS

CAS compares a memory location with a given value, and if they are the same the new value is set. The return value is the value before the swap was attempted. That way, we can know if the memory location was written to between our read and write, and repeat the operation if it was.

Atomicity is guaranteed when used like this:

Read the original value from a memory location.
Compute the new value to be set.
Set the new value only if the memory location is still the original value.

On x86, the lock instruction prefix makes some instructions (ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG) atomics. The CMPXCHG instruction is used to implement CAS.

Here’s the atomic version of the NonAtomicAND function we’ve seen in the introduction, using CAS:

void AtomicAND(volatile tS32* Value, tS32 Op)
{
    while(1)
    {
        const tS32 OldValue = *Value;
        const tS32 NewValue = OldValue & Op;

        // If the result is the original value, the new value was stored.
        if(CAS(Value, NewValue, OldValue) == OldValue)
        {
            return;
        }
    }
}

As you can see, we first read the original value and compute the new one. Then, we try to store the new value only if the current memory location is still the original one. If it changed, we need to repeat the operation until we succeed.

The ABA Problem
Be aware of the ABA problem though: between the time that you read the original value and try to swap it with the new one, it could have been changed to something else and back to the original value. In that case, that change will not be detected.

For simple operations, like the AtomicAND function, this isn’t a problem since the resulting value is still valid in the end. But when implementing lock-free algorithms such as queues or linked-lists, this will cause unwanted behavior.

The usual solution to this problem is to append a counter to the values which is incremented at each operation. That way, A-B-A becomes A1-B2-A3 and the changes is properly detected. It may not be as easy as it looks though and may requires 64/128 bits CAS instructions, especially when working with pointers.

LL/SC

LL/SC works differently. It is implemented using two instructions (lwarx (LL)/stwcx (SC) on PowerPC): LL load and reserve the memory location, while SC store the new value only if the memory location is still reserved. The memory reservation is lost whenever it gets written to. For this reason, LL/SC does not suffer from the ABA problem.

Here’s the atomic version of the NonAtomicAND function we’ve seen in the introduction, using LL/SC:

void AtomicAND(volatile tS32* Value, tS32 Op)
{
    while(1)
    {
        const tS32 NewValue = __lwarx(Value) & Op;

        // If the reservation was still valid, new value was stored.
        if(__stwcx(Value, NewValue))
        {
            return;
        }
    }
}

Reservation Granularity
Depending on the architecture, the reservation is performed on aligned words or, in the worst case, on the whole cache-lines. On current-generation PowerPC consoles, the reservation granularity is cache-lines, so special care must be taken to avoid false sharing or the performance hit can be dramatic.

CAS Emulation
Implementing CAS using LL/SC instructions may be tempting to maintain platform-agnostic code. Doing so adds a compare and a branch though, which could be optimized depending on the compiler and optimization level.

Here’s the code for AtomicAND from the native LL/SC implementation:

loop:   lwarx   r6,0,r3          # Load and create reservation
        and     r4,r6,r5         # Compute the new value
        stwcx   r4,0,r3          # Store the new value if the reservation is still valid
        bne     loop             # Loop if the reservation was invalidated

And here’s the one using the CAS emulation:

loop:   lwz     r8,0(r3)         # Load the original value
        and     r4,r8,r5         # Compute the new value
        lwarx   r6,0,r3          # Load and create reservation
        cmpw    r8,r6            # CAS comparison
        bne     loop             # Retry if not equal
        stwcx   r4,0,r3          # Store the new value if the reservation is still valid
        bne     loop             # Loop if the reservation was invalidated

As we can see, CAS emulation is slower. Depending on the usage, using native LL/SC might help in some edge cases.

Performance

If threads competition is low, the compare loops should almost never loop. However, when used very concurrently by a high number of threads, the loop count can be quite high and can even lead to a livelock, where a thread is never able to set the new value (or takes a large amount of time to do so) due to other threads always modifying it at the same time. Fixing this particular issue might require algorithm-level refactoring.

Conclusion

In an upcoming post, I’ll talk about memory ordering issues, which is especially important when using atomic operations on some platforms.

Lessons learnt while spinning

September 24, 2011 3 comments

In a recent project I worked on, I tried to minimize context switches by strategically replacing a critical section in our memory allocator by a custom spinlock that looked like this:

class SpinLock
{
    volatile tInt LockSem;

public:
    FORCEINLINE SpinLock()
    : LockSem(0)
    {}

    FORCEINLINE tBool Lock()
    {
        while(1)
        {
            // Atomically swap the lock variable with 1 if it's currently equal to 0
            if(!InterlockedCompareExchange(&LockSem, 1, 0))
            {
                // We successfully acquired the lock
                MemoryBarrier();
                return;
            }
        }
    }

    FORCEINLINE void Unlock()
    {
        MemoryBarrier();
        LockSem = 0;
    }
};

This indeed worked as predicted, but produced an unwanted side-effect on platforms without threads priority inversion (where a high-priority thread spins in a loop while waiting for a low-priority thread to release a lock) as the Xbox 360 (described in this technical article).

It created rare situations where the game was deadlocked, and the time spent to figure-out what was going on was quite high. At first, we played with threads priorities and affinities, but it had other unwanted side-effects. Another solution tried was to yield after a certain amount of spin loops, generating a context switch that let the OS reschedule threads correctly. In the end, we reverted to a critical section.

Lesson learnt? We have to be very careful when busy-waiting, especially on OSs that don’t have priority inversion protection mechanisms. Using the OS synchronization primitives is the best way of coordinating multiple threads. When locking becomes a performance bottleneck, we have to address the underlying problem (using lockless programming, etc).

Categories: Programming Tags:
%d bloggers like this: