Understanding Atomic Operations

Home > Programming > Understanding Atomic Operations

Understanding Atomic Operations

November 30, 2011 jfdube Leave a comment Go to comments

Introduction

Atomic operations are the building blocks of synchronization primitives and non-blocking algorithms. They guarantee that, when modifying a memory location, it will happen without any interference from other threads. They are required whenever writing applications for multiple-cores architectures.

Consider this function:

void NonAtomicAND(tS32* Value, tS32 Op)
{
    *Value &= Op;
}

Which translates to this on x86:

mov eax, dword ptr [Value]  
mov ecx, dword ptr [eax]  
and ecx, dword ptr [Op]  
mov dword ptr [eax], ecx

As you can see, we read the value in a register on line 2. Then we perform the operation on line 3 and finally update the memory location with the new value on line 4. What would happen if another thread updates the same memory location while we’re holding the new value in a register? That’s it: undefined behavior. We need a way to know if the memory location was modified between the time we performed the read and the write.

Hardware Implementations

Depending on the platform, there’s two atomic operations hardware implementations: Compare-And-Swap (CAS) on x86 and Load-Link/Store-Conditional (LL/SC) on Alpha, PowerPC, MIPS and ARM.

CAS

CAS compares a memory location with a given value, and if they are the same the new value is set. The return value is the value before the swap was attempted. That way, we can know if the memory location was written to between our read and write, and repeat the operation if it was.

Atomicity is guaranteed when used like this:

Read the original value from a memory location.
Compute the new value to be set.
Set the new value only if the memory location is still the original value.

On x86, the lock instruction prefix makes some instructions (ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG) atomics. The CMPXCHG instruction is used to implement CAS.

Here’s the atomic version of the NonAtomicAND function we’ve seen in the introduction, using CAS:

void AtomicAND(volatile tS32* Value, tS32 Op)
{
    while(1)
    {
        const tS32 OldValue = *Value;
        const tS32 NewValue = OldValue & Op;

        // If the result is the original value, the new value was stored.
        if(CAS(Value, NewValue, OldValue) == OldValue)
        {
            return;
        }
    }
}

As you can see, we first read the original value and compute the new one. Then, we try to store the new value only if the current memory location is still the original one. If it changed, we need to repeat the operation until we succeed.

The ABA Problem
Be aware of the ABA problem though: between the time that you read the original value and try to swap it with the new one, it could have been changed to something else and back to the original value. In that case, that change will not be detected.

For simple operations, like the AtomicAND function, this isn’t a problem since the resulting value is still valid in the end. But when implementing lock-free algorithms such as queues or linked-lists, this will cause unwanted behavior.

The usual solution to this problem is to append a counter to the values which is incremented at each operation. That way, A-B-A becomes A1-B2-A3 and the changes is properly detected. It may not be as easy as it looks though and may requires 64/128 bits CAS instructions, especially when working with pointers.

LL/SC

LL/SC works differently. It is implemented using two instructions (lwarx (LL)/stwcx (SC) on PowerPC): LL load and reserve the memory location, while SC store the new value only if the memory location is still reserved. The memory reservation is lost whenever it gets written to. For this reason, LL/SC does not suffer from the ABA problem.

Here’s the atomic version of the NonAtomicAND function we’ve seen in the introduction, using LL/SC:

void AtomicAND(volatile tS32* Value, tS32 Op)
{
    while(1)
    {
        const tS32 NewValue = __lwarx(Value) & Op;

        // If the reservation was still valid, new value was stored.
        if(__stwcx(Value, NewValue))
        {
            return;
        }
    }
}

Reservation Granularity
Depending on the architecture, the reservation is performed on aligned words or, in the worst case, on the whole cache-lines. On current-generation PowerPC consoles, the reservation granularity is cache-lines, so special care must be taken to avoid false sharing or the performance hit can be dramatic.

CAS Emulation
Implementing CAS using LL/SC instructions may be tempting to maintain platform-agnostic code. Doing so adds a compare and a branch though, which could be optimized depending on the compiler and optimization level.

Here’s the code for AtomicAND from the native LL/SC implementation:

loop:   lwarx   r6,0,r3          # Load and create reservation
        and     r4,r6,r5         # Compute the new value
        stwcx   r4,0,r3          # Store the new value if the reservation is still valid
        bne     loop             # Loop if the reservation was invalidated

And here’s the one using the CAS emulation:

loop:   lwz     r8,0(r3)         # Load the original value
        and     r4,r8,r5         # Compute the new value
        lwarx   r6,0,r3          # Load and create reservation
        cmpw    r8,r6            # CAS comparison
        bne     loop             # Retry if not equal
        stwcx   r4,0,r3          # Store the new value if the reservation is still valid
        bne     loop             # Loop if the reservation was invalidated

As we can see, CAS emulation is slower. Depending on the usage, using native LL/SC might help in some edge cases.

Performance

If threads competition is low, the compare loops should almost never loop. However, when used very concurrently by a high number of threads, the loop count can be quite high and can even lead to a livelock, where a thread is never able to set the new value (or takes a large amount of time to do so) due to other threads always modifying it at the same time. Fixing this particular issue might require algorithm-level refactoring.

Conclusion

In an upcoming post, I’ll talk about memory ordering issues, which is especially important when using atomic operations on some platforms.

Categories: Programming Tags: Atomic Operations, Threading

Comments (5) Trackbacks (9) Leave a comment Trackback

Flavio

March 5, 2012 at 8:40 am

Reply

I found this post very interesting, and I’m eagerly awaiting for the post on memory ordering issues. Thanks!
- jfdube
  
  March 5, 2012 at 10:18 am
  
  Reply
  
  In fact, the wikipedia page is so clear and complete, I don’t know if a blog post on the subject will really be required.
  - Flavio
    
    March 8, 2012 at 3:23 am
    
    Reply
    
    For me it’s far from complete, particularly with respect to available operations on x86 platforms, meaning of acquire/release operations, and actual code!
    - jfdube
      
      March 8, 2012 at 3:07 pm
      
      Reply
      
      Done, sir!
Jeff Preshing

June 14, 2012 at 12:03 am

Reply

Hey JF, if I could offer one possible correction, it’s that the performance bottleneck you mention near the end is technically not a livelock. Of course individual threads can starve, but each time a thread has to repeat the CAS or LL/SC loop, it’s because another thread made progress.

February 26, 2012 at 3:20 pm

Roll Your Own Lightweight Mutex
March 8, 2012 at 8:46 am

Understanding Memory Ordering « 0xjfdube
June 12, 2012 at 7:07 am

An Introduction to Lock-Free Programming
January 3, 2014 at 11:34 am

Implementing a recursive read-write spinlock | 0xjfdube
November 19, 2014 at 11:01 pm

An Introduction to Lock-Free Programming | 北半球的天空
December 28, 2014 at 6:43 am

Ceiba3D Studio | An Introduction to Lock-Free Programming (2012)
April 20, 2018 at 9:51 am

ARMプロセッサにおけるロックフリーなデータ更新 | IT技術情報局
December 20, 2020 at 7:45 pm

An Introduction to Lock-Free Programming - JellyEnt
August 12, 2022 at 12:42 am

Can num++ be atomic for ‘int num’? – Code D3

0xjfdube

Understanding Atomic Operations

Introduction

Hardware Implementations

CAS

LL/SC

Performance

Conclusion

Leave a comment Cancel reply

Recent Posts

Archives

Topics

Twitter Updates

Follow Blog via Email

0xjfdube

Understanding Atomic Operations

Introduction

Hardware Implementations

CAS

LL/SC

Performance

Conclusion

Share this:

Related

Leave a comment Cancel reply

Recent Posts

Archives

Topics

Twitter Updates

Follow Blog via Email