Home > Programming > Understanding Atomic Operations

Understanding Atomic Operations

Introduction

Atomic operations are the building blocks of synchronization primitives and non-blocking algorithms. They guarantee that, when modifying a memory location, it will happen without any interference from other threads. They are required whenever writing applications for multiple-cores architectures.

Consider this function:

void NonAtomicAND(tS32* Value, tS32 Op)
{
    *Value &= Op;
}

Which translates to this on x86:

mov eax, dword ptr [Value]  
mov ecx, dword ptr [eax]  
and ecx, dword ptr [Op]  
mov dword ptr [eax], ecx  

As you can see, we read the value in a register on line 2. Then we perform the operation on line 3 and finally update the memory location with the new value on line 4. What would happen if another thread updates the same memory location while we’re holding the new value in a register? That’s it: undefined behavior. We need a way to know if the memory location was modified between the time we performed the read and the write.

Hardware Implementations

Depending on the platform, there’s two atomic operations hardware implementations: Compare-And-Swap (CAS) on x86 and Load-Link/Store-Conditional (LL/SC) on Alpha, PowerPC, MIPS and ARM.

CAS

CAS compares a memory location with a given value, and if they are the same the new value is set. The return value is the value before the swap was attempted. That way, we can know if the memory location was written to between our read and write, and repeat the operation if it was.

Atomicity is guaranteed when used like this:

Read the original value from a memory location.
Compute the new value to be set.
Set the new value only if the memory location is still the original value.

On x86, the lock instruction prefix makes some instructions (ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG) atomics. The CMPXCHG instruction is used to implement CAS.

Here’s the atomic version of the NonAtomicAND function we’ve seen in the introduction, using CAS:

void AtomicAND(volatile tS32* Value, tS32 Op)
{
    while(1)
    {
        const tS32 OldValue = *Value;
        const tS32 NewValue = OldValue & Op;

        // If the result is the original value, the new value was stored.
        if(CAS(Value, NewValue, OldValue) == OldValue)
        {
            return;
        }
    }
}

As you can see, we first read the original value and compute the new one. Then, we try to store the new value only if the current memory location is still the original one. If it changed, we need to repeat the operation until we succeed.

The ABA Problem
Be aware of the ABA problem though: between the time that you read the original value and try to swap it with the new one, it could have been changed to something else and back to the original value. In that case, that change will not be detected.

For simple operations, like the AtomicAND function, this isn’t a problem since the resulting value is still valid in the end. But when implementing lock-free algorithms such as queues or linked-lists, this will cause unwanted behavior.

The usual solution to this problem is to append a counter to the values which is incremented at each operation. That way, A-B-A becomes A1-B2-A3 and the changes is properly detected. It may not be as easy as it looks though and may requires 64/128 bits CAS instructions, especially when working with pointers.

LL/SC

LL/SC works differently. It is implemented using two instructions (lwarx (LL)/stwcx (SC) on PowerPC): LL load and reserve the memory location, while SC store the new value only if the memory location is still reserved. The memory reservation is lost whenever it gets written to. For this reason, LL/SC does not suffer from the ABA problem.

Here’s the atomic version of the NonAtomicAND function we’ve seen in the introduction, using LL/SC:

void AtomicAND(volatile tS32* Value, tS32 Op)
{
    while(1)
    {
        const tS32 NewValue = __lwarx(Value) & Op;

        // If the reservation was still valid, new value was stored.
        if(__stwcx(Value, NewValue))
        {
            return;
        }
    }
}

Reservation Granularity
Depending on the architecture, the reservation is performed on aligned words or, in the worst case, on the whole cache-lines. On current-generation PowerPC consoles, the reservation granularity is cache-lines, so special care must be taken to avoid false sharing or the performance hit can be dramatic.

CAS Emulation
Implementing CAS using LL/SC instructions may be tempting to maintain platform-agnostic code. Doing so adds a compare and a branch though, which could be optimized depending on the compiler and optimization level.

Here’s the code for AtomicAND from the native LL/SC implementation:

loop:   lwarx   r6,0,r3          # Load and create reservation
        and     r4,r6,r5         # Compute the new value
        stwcx   r4,0,r3          # Store the new value if the reservation is still valid
        bne     loop             # Loop if the reservation was invalidated

And here’s the one using the CAS emulation:

loop:   lwz     r8,0(r3)         # Load the original value
        and     r4,r8,r5         # Compute the new value
        lwarx   r6,0,r3          # Load and create reservation
        cmpw    r8,r6            # CAS comparison
        bne     loop             # Retry if not equal
        stwcx   r4,0,r3          # Store the new value if the reservation is still valid
        bne     loop             # Loop if the reservation was invalidated

As we can see, CAS emulation is slower. Depending on the usage, using native LL/SC might help in some edge cases.

Performance

If threads competition is low, the compare loops should almost never loop. However, when used very concurrently by a high number of threads, the loop count can be quite high and can even lead to a livelock, where a thread is never able to set the new value (or takes a large amount of time to do so) due to other threads always modifying it at the same time. Fixing this particular issue might require algorithm-level refactoring.

Conclusion

In an upcoming post, I’ll talk about memory ordering issues, which is especially important when using atomic operations on some platforms.

About these ads
  1. Flavio
    March 5, 2012 at 8:40 am

    I found this post very interesting, and I’m eagerly awaiting for the post on memory ordering issues. Thanks!

    • jfdube
      March 5, 2012 at 10:18 am

      In fact, the wikipedia page is so clear and complete, I don’t know if a blog post on the subject will really be required.

      • Flavio
        March 8, 2012 at 3:23 am

        For me it’s far from complete, particularly with respect to available operations on x86 platforms, meaning of acquire/release operations, and actual code!

        • jfdube
          March 8, 2012 at 3:07 pm

          Done, sir!

  2. June 14, 2012 at 12:03 am

    Hey JF, if I could offer one possible correction, it’s that the performance bottleneck you mention near the end is technically not a livelock. Of course individual threads can starve, but each time a thread has to repeat the CAS or LL/SC loop, it’s because another thread made progress.

  1. February 26, 2012 at 3:20 pm
  2. March 8, 2012 at 8:46 am
  3. June 12, 2012 at 7:07 am
  4. January 3, 2014 at 11:34 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: