Archive for March, 2012

Understanding Memory Ordering

March 8, 2012 13 comments


In a previous post on atomic operations, I skipped an important topic, which is memory ordering issues.

Behind the hood, the processor might reorder memory operations (reads and writes) for many reasons. This is normally not a problem on programs running on single-core machines. On the other hand, multi-threaded programs running on multi-cores machines can suffer from that: the sequence in which reads and writes operations are performed by the processor can be different from what the order of execution of those operations are from the programmer’s point of view.

Simple Example
Let’s take a look at a simple example:

volatile bool Ready = false;
int Value = 0;

// Thread A
while(!Ready) {}
printf("%d", Value);

// Thread B
Value = 1;
Ready = true;

The expected value to be printed is 1, obviously. But what if Ready end-up written in memory before Value does? The value of Value will not be 1, resulting in random behavior.

Production Code Example
In a previous game I’ve worked on, which was on Xbox 360, we had this piece of code that registered strings in a hash-table using the CRC of the string. Only the main thread was allowed to create new strings in the table, but all threads were allowed to look-up for existing strings.

// Called to register a new string in the hash-table, returning the unique CRC value.
tU32 RegisterString(tChar* String)
    // Allocate a new StringDesc from the freelist.
    StringDesc* NewStringDesc = FreeStringDescs.Allocate();

    // Set the string desc info.
    NewStringDesc->String = String;
    NewStringDesc->CRC = StrHash(String);

    // Prepare the entry to be registered into the hash-table.
    tU32 HashBucket = NewStringDesc->CRC & (STRING_DESC_SIZE - 1);
    NewStringDesc->Next = Hash[HashBucket];

    // A memory barrier is required here on some architectures, before registering
    // the new entry into the hash-table. Otherwise, other threads might access the
    // entry before String/CRC/Next members gets written in memory, resulting in
    // random crashes.

    // Register the allocation in the hashtable. At this point, this entry is
    // visible to all other threads.
    Hash[HashBucket] = NewStringDesc;

    return NewStringDesc->CRC;

It took a while to figure-out why we had random crashes with very-low reproduction rate before realizing that we needed a way to ensure that the new entry values were written into memory before actually registering it into the shared hash-table.

As you can see in line 19, a memory barrier was used.

Memory Barriers

Memory barriers are a set of processor instructions used to force pending memory operations to complete before continuing. There are three types of barrier semantics: acquire, release and fence barriers.

Acquire Semantics
Whenever you use atomic operations to gain access to some data, your code must make sure that other processors sees the lock before any other changes that will be made. This is what we call acquire semantics, because the code is trying to acquire ownership of some data. This is also referred as read barriers or import barriers.

  operation 1
  operation 2
<-operation 3-Acquire-> 3 is visible before 4-5
  operation 4
  operation 5

Release Semantics
On the other hand, if you use atomic operations to release recently modified data, your code must make sure that the new data is visible before releasing it. This was exactly the case with the previous string-table example. This is what we call release semantics, because the code is trying to release ownership of some data. This is also referred as write barriers or export barriers.

  operation 1
  operation 2
<-operation 3-Release-> 1-2 are visible before 3
  operation 4
  operation 5

Fence Semantics
A fence semantics combines both acquire and release semantics behavior.

  operation 1
  operation 2
<-operation 3-Fence-> 1-2 are visible before 3, 3 is visible before 4-5
  operation 4
  operation 5


As the implementation differs depending on the architecture, it’s a good habit to wrap those functionality into platform-agnostic functions, as seen in the previous examples.

On Windows, the atomic functions acts as full memory barriers (fences). Each of these functions comes with additional functions with the word “Acquire” and “Release” in their names, that can generate different types of barriers semantics depending if the target platform supports it (such as Itanium).

For example, the InterlockedCompareExchange function also comes as InterlockedCompareExchangeAcquire and InterlockedCompareExchangeRelease.

On PowerPC architectures, such as the Xbox 360 and PS3, memory barriers are implemented using the lwsync instruction. There is no distinction between the different semantics, so it acts as a read/write memory barrier.

Synchronization Primitives

Synchronization primitives requires deep knowledge of the architecture’s memory model and must use memory barriers to prevent memory ordering issues. As a general rule, a programmer should never write its own locks, except in rare situations where special requirements are made.

Spin Lock Example
Let’s take a look at the spinlock I shown in my Lessons learnt while spinning post:

class SpinLock
    volatile tInt LockSem;

    FORCEINLINE SpinLock()
    : LockSem(0)

    FORCEINLINE tBool Lock()
            // Atomically swap the lock variable with 1 if it's currently equal to 0
            if(!InterlockedCompareExchange(&LockSem, 1, 0))
                // We successfully acquired the lock

    FORCEINLINE void Unlock()
        LockSem = 0;

An import barrier is required on line 18 when the lock is acquired to make sure other threads don’t see upcoming changes before seeing that the data is locked. Also, an export barrier is required on line 26 to make sure that all the pending memory operations completes before releasing the lock.

The ‘volatile’ C++ Keyword

A quick word on the volatile C++ keyword: according to Microsoft, volatile variables acts as memory barriers: “A write to a volatile object (volatile write) has Release semantics; A read of a volatile object (volatile read) has Acquire semantics”.

Since this is Microsoft-specific, we cannot rely on this behavior on other platforms.


  • Always use the OS-provided synchronizations primitives when locks are required.
  • The volatile keyword should only be used to prevent compiler optimizations, such as keeping a variable in a register while waiting it to changes.
  • Make sure you understand memory barriers whenever implementing multi-cores programs, especially lockless algorithms.
  • When in doubt, use a full barrier (fence).
%d bloggers like this: