Home > Programming > Optimizing the recursive read-write spinlock

Optimizing the recursive read-write spinlock

In my previous post on implementing a recursive read-write spinlock, I briefly talked about trivial optimizations that could be done to improve the time spent in the lock functions.

In this post, we’ll measure these optimizations, propose more optimizations to the read-write spinlock and compare all optimizations results.

Stress tests
The numbers that will be given in this post comes from two stress-test applications that were written to compare the different optimizations. Essentially, the tests consist of spawning some threads and executing huge amount of locks in parallel: a read test that only test the reader locks without any writers involved, and a read-write test that tests reader and writer locks simultaneously.

Measuring optimizations from the first post

The cost of atomic operations
The first optimization that was done in the first post was to reduce the number of atomic operations by busy-waiting while the writer bit was set and only attempt an atomic operation when required.

The measured speedup was ~34% faster in the read-write test. We can clearly see that atomic operations aren’t free.

Busy-waiting
The second optimization that was done in the first post was to yield the processor while waiting.

The measured speedup was ~5% faster in the read-write test; since the test was using the same number of threads than logical cores, the speedup isn’t spectacular. On the other hand, applications with a lot more threads than logical processors would observe a higher performance gain, so this optimization is still valid for all cases.

Readers contention

A thing that I noticed after writing the spinlock post was that it is suffering from readers contention, due to the CAS operations.

Consider this simple example:

Thread-1               Thread-2               Thread-3
1: Read lock value     1: Read lock value     1: Read lock value
2: Increment readers   2: Increment readers   2: Increment readers
3: CAS (succeeds)      3: CAS (fails)         3: CAS (fails)
4: ...                 4: Read lock value     4: Read lock value
5: ...                 5: Increment readers   5: Increment readers
6: ...                 6: CAS (succeeds)      6: CAS (fails)
7: ...                 7: ...                 7: Read lock value
8: ...                 8: ...                 8: Increment readers
9: ...                 9: ...                 9: CAS (succeeds)

This is technically not contention, as each thread is making progress, but it goes against the original purpose of the read-write spinlock mentionned in the first post: must allow any number of readers to acquire the lock without contention.

The key here to solve this problem is to avoid CAS loops by using other types of atomic instructions.

New implementation

In the new implementation, I’ll solely use atomic increments, decrements, adds and subtracts. This will completly remove the CAS contention.

The spinlock will still be implemented using a 32-bits value, but more bits will need to be reserved for the writer (we’ll see why shortly):

|WWWWWWWWWWWW|RRRRRRRRRRRRRRRRRRRR|

The writer bits are still carefully placed on the most significant bits, as described in the previous post.

Acquiring a reader lock

To acquire the reader lock, we atomically increment the lock value and inspect the return value (line 8): if any of writer bits are set, it means that a writer held the lock before us and we failed. In that case, we atomically decrement the lock value and try again (line 11). Otherwise, we acquired the reader lock successfuly.

FORCEINLINE void LockReader()
{
    for(;;)
    {
        // Wait for active writer to release the lock
        while(Lock & 0xfff00000)
            kYieldProcessor();

        if((kInterlockedIncrement(&Lock) & 0xfff00000) == 0)
            return;

        kInterlockedDecrement(&Lock);
    }
}

Releasing a reader lock

Releasing a reader lock with the new implementation can be achieved by atomically decrementing the lock value (same code as previous implementation):

FORCEINLINE void UnlockReader()
{
    kInterlockedDecrement(&Lock);
}

Acquiring the writer lock

To acquire the writer lock, we atomically increment the numbers of writers (using an atomic add) and inspect the return value (line 9): if only the first writer bit was set during this operation, we acquired the writer lock successfuly and wait for any readers to end (line 12). Otherwise, another writer held the lock before us and we failed. In that case, we atomically decrement the number of writers (using an atomic subtract) and try again (line 18).

FORCEINLINE void LockWriter()
{
    for(;;)
    {
        // Wait for active writer to release the lock
        while(Lock & 0xfff00000)
            kYieldProcessor();

        if((kInterlockedAdd(&Lock, 0x100000) & 0xfff00000) == 0x100000)
        {
            // Wait until there's no more readers
            while(Lock & 0x000fffff)
                kYieldProcessor();

            return;
        }

        kInterlockedSub(&Lock, 0x100000);
    }
}

This is why the lock value now needs more bits for the writers part: when several writers try to acquire the lock, the value might get incremented up to at least the number of active threads. When the value isn’t exactly one, the writer failed to acquire the lock and decrement the value before trying again.

In this example, 12 bits were reserved for to writers, allowing up to 4096 concurrent writer threads. This value should be tweaked depending on the application.

Releasing the writer lock

Releasing a writer lock is done by atomically decrementing the number of writers, using an atomic subtract:

FORCEINLINE void UnlockWriter()
{
    kInterlockedSub(&Lock, 0x100000);
}

Optimization results

The measured speedup for the new implementation in the read-write test was ~62%. Remember that this test was only acquiring the reader lock, with no writers involved. This is quite an improvement!

Overall results comparison

Here’s the results of all optimizations described in this post:

  • SpinLock is a standard spinlock.
  • RWLock is the read-write spinlock from the first post, without any optimization.
  • RWLock_1 adds the atomic operations reduction optimization.
  • RWLock_2 adds the yield processor optimization.
  • RWLock_3 is the new implementation, with all optimizations.

Conclusion

Some things to keep in mind:

  • Atomic operations aren’t free; in fact, they cost a lot and should be avoided whenever possible.
  • When writing lockfree algorithms using CAS loops, think twice and ask yourself if this could be done using other atomic operations.
Advertisements
  1. July 28, 2014 at 9:17 pm

    I’m not sure if your LockReader() is entirely safe. I can imagine a scenario where threads could stay locked forever.

    – Three threads enter the lock.
    – Each increment the counter. ( counter is now at 3 )
    – Only one gets though. The other two are left waiting
    – The main thread exits. ( decrementing the counter to 2 )
    – One of the waiting threads is allowed to execute. He decrements to 1 then increments to 2 again. And then gets put to sleep.
    – The other waiting thread is allowed to execute. He decrements to 1 then increments to 2 again. And then gets put to sleep.

    The last two steps continue forever….

    • jfdube
      March 18, 2015 at 6:18 am

      As readers won’t wait when there’s no active writers, I don’t think this scenario can happen.

  2. Patrickvl
    September 17, 2015 at 12:16 pm

    This looks a bit like SRWLocks – https://msdn.microsoft.com/en-us/library/aa904937(v=VS.85).aspx

    I wonder how the two compare…

  3. January 28, 2016 at 7:15 am

    After extensive testing, we found out this lock performs better then SRWLocks.

    One question about LockWriter;
    The while “Wait for active writer to release the lock” loop could be removed, if kYieldProcessor() is placed below the correcting InterlockedSub (just before the main loop repeats).
    This might be faster (since less code is executed, and the first thing that happens, is the attempt to write-lock).
    But, would this change anything about behaviour of this lock?

    • jfdube
      February 29, 2016 at 2:01 pm

      I would be *very* interested on having details about your “extensive testing”…

  4. January 28, 2016 at 7:37 am

    Nevermind my last remark, it would increase the number of atomic operations. My bad, I should’ve remembered that!

  1. January 12, 2014 at 9:28 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: