Home > Programming > Implementing a recursive read-write spinlock

Implementing a recursive read-write spinlock

Very often I see programmers using a critical section to make some code thread-safe. What happens is that most of the time, the protected code will only read data, while blocking other reading threads and lead to poor performances.

Most of the time, a read-write lock can be used to only block readers when there’s a write operation in progress, greatly reducing the time spent waiting for other threads to release the lock.

Why a spinlock?
In a previous post, I spoke about the dangers of busy-waiting, especially when implementing our own spinlocks on OSs that doesn’t support priority inversion protection mechanisms.

However, writing our own spinlocks implementation can have several advantages, the main one being pedagogic. Other advantages of spinlocks is that they are lightweight and doesn’t trigger a context switch when waiting.

Requirements

A read-write spinlock have the following requirements:

  1. Must allow any number of readers to acquire the lock without contention.
  2. Must allow only a single writer to acquire the lock while:
    • Block future readers.
    • Wait until active readers release their locks.

Allowing a writer to acquire the lock while there’s still active readers is important, as it prevents from write-starvation; that is, if the lock was designed to wait until there would be no active readers before allowing a writer to acquire the lock, it could happen that a writer would never be able to acquire the lock, leading to write-starvation.

Implementation

Our spinlock will be implemented using a 32-bits value. Because only a single writer can acquire the lock, we’ll reserve a single bit to know if the lock is acquired by a writer and the 31 other bits to count the number of active readers (allowing 2^31 readers):

|W|RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR|

The writer bit is carefully placed on the most significant bit, we’ll see why later on.

The implementation will use the compare-and-swap (CAS) atomic operation. If you’re not sure what this operation does, I suggest that you take some time to read my previous post on atomic operations before continuing.

RWSpinLock definition

As described in the previous section, the spinlock contains a single 32-bits value, initialized to zero:

class RWSpinLock
{
    volatile tU32 Lock;

public:
    RWSpinLock()
    : Lock(0)
    {}
};

Acquiring a reader lock

Acquiring the lock for a reader goes like this:

  1. Read the lock value and remove the writer bit (line 5).
  2. Compute the new lock value by incrementing the number of readers (line 6).
  3. Try to atomically exchange the new value with the old one (line 8).
FORCEINLINE void LockReader()
{
    for(;;)
    {
        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = OldLock + 1;

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
            return;
    }
}

Let’s take a look at each operation in details.

Step 1
The first step is to read the current lock value and remove the writer bit. Doing that will ensure that the CAS operation will fail if a writer currently has the lock acquired; if a writer has the lock, the CAS comparison will fail since we’ll be comparing a value without the writer bit to the current value with the writer bit.

Step 2
The second step simply increment the number of readers by one: it takes the expected value (without the writer bit), and adds one.

Step 3
The last step will try to atomically set the new lock value using the CAS operation. Remember that in step 1 we removed the writer bit; that way, if the bit was set before or during our lock tentative, the CAS operation will fail and we’ll have to try again.

Releasing a reader lock

Releasing a reader lock is simple: we just decrement the lock value (line 3):

FORCEINLINE void UnlockReader()
{
    kInterlockedDecrement(&Lock);
}

Remember that the writer bit was carefully placed on the most significant bit? The reason is that even if a writer has acquired the lock during our read operation and is waiting for readers to end, decrementing the lock value will never affect the writer bit. This has the advantage of avoiding a potential CAS loop.

Acquiring the writer lock

Acquiring the lock for a writer goes like this:

  1. Read the lock value and remove the writer bit (line 5).
  2. Compute the new lock value by setting the writer bit (line 6).
  3. Try to atomically exchange the new value with the old one (line 8).
  4. If step 3 worked, wait for active readers to release their locks (line 10).
FORCEINLINE void LockWriter()
{
    for(;;)
    {
        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = (OldLock | 0x80000000);

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
        {
            while(Lock & 0x7fffffff);
            return;
        }
    }
}

Let’s take a look at each operation in details.

Step 1
The first step is to read the current lock value and remove the writer bit. As with acquiring the reader lock, doing that will ensure that the CAS operation will fail if a writer currently has the lock acquired or did so while we were computing new values.

Step 2
The second step simply set the writer bit, to mark the lock as locked by a writer.

Step 3
The third step will try to atomically set the new lock value using the CAS operation. Again, this step if similar to acquiring a reader lock.

Step 4
The last step will simply busy-wait for all readers to release their locks.

Releasing the writer lock

Releasing a writer lock is simple: we just set the lock value to zero (line 4):

FORCEINLINE void UnlockWriter()
{
    kAssert(Lock == 0x80000000);
    Lock = 0;
}

Normally, only the writer bit should be set at this point (no readers should have the lock acquired), so setting the lock value to zero will release the writer lock and let potential waiting readers and/or writers to acquire the lock.

Recursion

Expert readers might have noticed that the code doesn’t fully support recursive locks.

Readers recursion
The code inherently supports recursion for readers: if a thread acquires the reader lock several times, this will simply increment the number of readers properly and will work as expected.

That said, using recursion for the readers lock should be done with great caution: if the lock is acquired for a thread and another thread acquire the writer lock, reader recursion will lead to a deadlock.

Writers recursion
On the other hand, acquiring the writer locks is not recursive: when already locked, any attempt to lock again the writer lock will fail on line 8 of the LockWriter function.

Adding writer recursion is just a matter of deriving our read-write spinlock class to keep track of the current thread-id and recursion depth:

class RWRecursiveSpinLock: public RWSpinLock
{
    tU32 ThreadID;
    tU32 RecursiveDepth;

public:
    FORCEINLINE RWRecursiveSpinLock()
    : ThreadID(-1)
    , RecursiveDepth(0)
    {}
};

Acquiring the writer lock
When the current thread already has the writer lock, we simply increment the recursion depth and return without calling the lock function; other threads trying to acquire the writer lock will wait on line 7:

FORCEINLINE void LockWriter()
{
    tU32 curThreadID = kGetCurrentThreadId();

    if(ThreadID != curThreadID)
    {
        RWSpinLock::LockWriter();
        kAssert(RecursiveDepth == 0);
        ThreadID = curThreadID;
    }

    RecursiveDepth++;
}

Releasing the writer lock
When releasing the writer lock, we first decrement the recursion depth (line 7) and only call the unlock function if the depth is zero (line 12), meaning that this is the last unlock call:

FORCEINLINE void UnlockWriter()
{
    tU32 curThreadID = kGetCurrentThreadId();
    kAssert(ThreadID == curThreadID);

    kAssert(RecursiveDepth > 0);
    RecursiveDepth--;

    if(RecursiveDepth == 0)
    {
        ThreadID = -1;
        RWSpinLock::UnlockWriter();
    }
}

Remember that all this is done under the writer lock, so no atomic operations are required and recursion is safe.

Optimizations

The code presented in this post can be optimized further to reduce its impact on performances.

Reducing atomic operations
Atomic operations aren’t free; so the lock functions should wait until the writer bit is set to zero (line 5):

FORCEINLINE void LockReader()
{
    for(;;)
    {
        while(Lock & 0x80000000);

        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = OldLock + 1;

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
            return;
    }
}

Waiting until the writer bit is set to zero is not a formal promise that we’ll successfully acquire the lock, since other threads might be waiting for the writer lock at the same time, but will greatly reduce the use of atomic operations while doing so.

Busy-waiting
Currently, the presented code is busy-waiting while waiting for locks to be released or when the CAS operation fails; this may not be the best thing to do depending on the situation.

When running on systems with hyper-threading or when there’s more threads than the number of physical processors, it could happen than the spinlock will be waiting for another thread running on the same logical processor. If this situation happens too much, using a kernel lock should be considered.

In situations when a spinlock is wanted, simply yielding the processor to other hyper-threads (using the YieldProcessor macro on MS platforms, for example), has a positive impact on performance (line 6):

FORCEINLINE void LockReader()
{
    for(;;)
    {
        while(Lock & 0x80000000)
            kYieldProcessor();

        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = OldLock + 1;

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
            return;
    }
}

Final optimized code
Here’s what the final lock functions looks like:

FORCEINLINE void LockReader()
{
    for(;;)
    {
        // Wait until there's no active writer
        while(Lock & 0x80000000)
            kYieldProcessor();

        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = OldLock + 1;

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
            return;
    }
}

FORCEINLINE void LockWriter()
{
    for(;;)
    {
        // Wait until there's no active writer
        while(Lock & 0x80000000)
            kYieldProcessor();

        tU32 OldLock = (Lock & 0x7fffffff);
        tU32 NewLock = (OldLock | 0x80000000);

        if(kInterlockedCompareExchange(&Lock, NewLock, OldLock) == OldLock)
        {
            // Wait for active readers to release their locks
            while(Lock & 0x7fffffff)
                kYieldProcessor();

            return;
        }
    }
}

Conclusion

In this post, I presented a complete, recursive read-write spinlock, along with some hints on when it should be used.

Always remember to profile your code to see if such a lock could be useful in your particular case and make sure it has a positive impact.

Next: Optimizing the recursive read-write spinlock.

  1. Alon
    January 3, 2014 at 12:47 pm

    Nice 🙂
    A couple of comments…
    1. There is a subtle race in UnlockReader(): if two threads attempt to unlock while there is only one reader (due to a lock-imbalance bug), they may both succeed to pass the kAssert() and end up setting the lock variable to -1.
    2. It is important to stress that step 3 of LockWriter() must compare the old value to OldLock (with the write-lock bit clear) – without this, we could end up with a double writer-lock…

    • jfdube
      January 3, 2014 at 1:01 pm

      Thanks Alon for the comments!

      You’re right about the UnlockReader bug. In fact, I changed the asserts for the post and added this bug while doing so.:S

      Fixed now!

  2. John Regehr
    January 23, 2014 at 1:32 pm

    The first UnlockWriter function is incorrect, a C compiler is permitted to reorder stores to non-volatile variables around stores to volatile variable. Thus, it can move stores to shared variables outside of the critical section. This isn’t theoretical, it’s easy to get GCC to do it.

  3. February 8, 2014 at 10:42 pm

    On top of that, while this code will probably run fine on platforms with strong memory ordering (x86), on PowerPC it’s perfectly possible CPU will reorder memory operations (stores can move ahead of other stores, so store to Lock can be moved ahead of whatever we’re guarding with lock). Also, as interlocked operations do not act as barriers on PowerPC, it’s also possible that accessing guarded variable will be moved “before” the lock.

    • jfdube
      February 9, 2014 at 2:49 pm

      You’re right, this code is intended to be used on x64 architectures. On PowerPC and other platforms, memory barriers might be required. I should have mentionned it, but I forgot. See Preshing’s posts for in-depth explanations.

  4. February 21, 2014 at 3:43 am

    Nice article! I have featured it in the last embedded systems weekly issue http://embedsysweekly.com/issue17-arduino-gcc-c-language/
    I hope you’ll like it, cheers 🙂

    • jfdube
      February 21, 2014 at 6:51 am

      Thanks!

  1. January 12, 2014 at 9:25 pm

Leave a comment