Home > Programming > Understanding Memory Ordering

Understanding Memory Ordering

Introduction

In a previous post on atomic operations, I skipped an important topic, which is memory ordering issues.

Behind the hood, the processor might reorder memory operations (reads and writes) for many reasons. This is normally not a problem on programs running on single-core machines. On the other hand, multi-threaded programs running on multi-cores machines can suffer from that: the sequence in which reads and writes operations are performed by the processor can be different from what the order of execution of those operations are from the programmer’s point of view.

Simple Example
Let’s take a look at a simple example:

volatile bool Ready = false;
int Value = 0;

// Thread A
while(!Ready) {}
printf("%d", Value);

// Thread B
Value = 1;
Ready = true;

The expected value to be printed is 1, obviously. But what if Ready end-up written in memory before Value does? The value of Value will not be 1, resulting in random behavior.

Production Code Example
In a previous game I’ve worked on, which was on Xbox 360, we had this piece of code that registered strings in a hash-table using the CRC of the string. Only the main thread was allowed to create new strings in the table, but all threads were allowed to look-up for existing strings.

// Called to register a new string in the hash-table, returning the unique CRC value.
tU32 RegisterString(tChar* String)
{
    // Allocate a new StringDesc from the freelist.
    StringDesc* NewStringDesc = FreeStringDescs.Allocate();

    // Set the string desc info.
    NewStringDesc->String = String;
    NewStringDesc->CRC = StrHash(String);

    // Prepare the entry to be registered into the hash-table.
    tU32 HashBucket = NewStringDesc->CRC & (STRING_DESC_SIZE - 1);
    NewStringDesc->Next = Hash[HashBucket];

    // A memory barrier is required here on some architectures, before registering
    // the new entry into the hash-table. Otherwise, other threads might access the
    // entry before String/CRC/Next members gets written in memory, resulting in
    // random crashes.
    ExportBarrier();

    // Register the allocation in the hashtable. At this point, this entry is
    // visible to all other threads.
    Hash[HashBucket] = NewStringDesc;

    return NewStringDesc->CRC;
}

It took a while to figure-out why we had random crashes with very-low reproduction rate before realizing that we needed a way to ensure that the new entry values were written into memory before actually registering it into the shared hash-table.

As you can see in line 19, a memory barrier was used.

Memory Barriers

Memory barriers are a set of processor instructions used to force pending memory operations to complete before continuing. There are three types of barrier semantics: acquire, release and fence barriers.

Acquire Semantics
Whenever you use atomic operations to gain access to some data, your code must make sure that other processors sees the lock before any other changes that will be made. This is what we call acquire semantics, because the code is trying to acquire ownership of some data. This is also referred as read barriers or import barriers.

  operation 1
  operation 2
<-operation 3-Acquire-> 3 is visible before 4-5
  operation 4
  operation 5

Release Semantics
On the other hand, if you use atomic operations to release recently modified data, your code must make sure that the new data is visible before releasing it. This was exactly the case with the previous string-table example. This is what we call release semantics, because the code is trying to release ownership of some data. This is also referred as write barriers or export barriers.

  operation 1
  operation 2
<-operation 3-Release-> 1-2 are visible before 3
  operation 4
  operation 5

Fence Semantics
A fence semantics combines both acquire and release semantics behavior.

  operation 1
  operation 2
<-operation 3-Fence-> 1-2 are visible before 3, 3 is visible before 4-5
  operation 4
  operation 5

Implementations

As the implementation differs depending on the architecture, it’s a good habit to wrap those functionality into platform-agnostic functions, as seen in the previous examples.

Windows
On Windows, the atomic functions acts as full memory barriers (fences). Each of these functions comes with additional functions with the word “Acquire” and “Release” in their names, that can generate different types of barriers semantics depending if the target platform supports it (such as Itanium).

For example, the InterlockedCompareExchange function also comes as InterlockedCompareExchangeAcquire and InterlockedCompareExchangeRelease.

PowerPC
On PowerPC architectures, such as the Xbox 360 and PS3, memory barriers are implemented using the lwsync instruction. There is no distinction between the different semantics, so it acts as a read/write memory barrier.

Synchronization Primitives

Synchronization primitives requires deep knowledge of the architecture’s memory model and must use memory barriers to prevent memory ordering issues. As a general rule, a programmer should never write its own locks, except in rare situations where special requirements are made.

Spin Lock Example
Let’s take a look at the spinlock I shown in my Lessons learnt while spinning post:

class SpinLock
{
    volatile tInt LockSem;

public:
    FORCEINLINE SpinLock()
    : LockSem(0)
    {}

    FORCEINLINE tBool Lock()
    {
        while(1)
        {
            // Atomically swap the lock variable with 1 if it's currently equal to 0
            if(!InterlockedCompareExchange(&LockSem, 1, 0))
            {
                // We successfully acquired the lock
                ImportBarrier();
                return;
            }
        }
    }

    FORCEINLINE void Unlock()
    {
        ExportBarrier();
        LockSem = 0;
    }
};

An import barrier is required on line 18 when the lock is acquired to make sure other threads don’t see upcoming changes before seeing that the data is locked. Also, an export barrier is required on line 26 to make sure that all the pending memory operations completes before releasing the lock.

The ‘volatile’ C++ Keyword

A quick word on the volatile C++ keyword: according to Microsoft, volatile variables acts as memory barriers: “A write to a volatile object (volatile write) has Release semantics; A read of a volatile object (volatile read) has Acquire semantics”.

Since this is Microsoft-specific, we cannot rely on this behavior on other platforms.

Conclusion

  • Always use the OS-provided synchronizations primitives when locks are required.
  • The volatile keyword should only be used to prevent compiler optimizations, such as keeping a variable in a register while waiting it to changes.
  • Make sure you understand memory barriers whenever implementing multi-cores programs, especially lockless algorithms.
  • When in doubt, use a full barrier (fence).
About these ads
  1. March 8, 2012 at 9:44 am

    In your production code example, how did you finally manage to track down the cause of the random crashes to the missing barrier?

    • jfdube
      March 8, 2012 at 10:10 am

      After analyzing some crashdumps generated by the testers, I finally came to the conclusion that the entry was registered in the hash-table with invalid data. The only way it could happen was that the Next pointer was seen before the rest of the data, so I added a memory barrier and waited a couple of days to see if the testers would hit the crash again, which never happened.

  2. March 8, 2012 at 4:28 pm

    Part of the issue comes from he compilers themselves, it isn’t just the CPU. That is, the compiler is also free to reorder instructions and it usually will for optimization. Thus the synchronization instructions don’t just tell the CPU not to reorder, but also tell the compiler not to reorder.

    • May 4, 2012 at 2:31 pm

      And often times the memory barriers are different for each. Ie, one can insert a compiler fence but not a processor fence and vice-versa.

  3. caleb
    March 8, 2012 at 9:12 pm

    As said in the article, InterlockedCompareExchange is already a full barrier, why an acquire barrier is still needed at 18?

    • jfdube
      March 9, 2012 at 6:41 am

      In my implementation, ImportBarrier() and ExportBarrier() are simply wrappers around platform-specific barriers. On Win32, they are only empty-stubs (ie does nothing), since atomic operations acts as full barriers. On PowerPC, they are both implemented using the __lwsync() intrinsic.

      • February 10, 2013 at 2:40 pm

        Be careful with that. Sometimes barriers are used without atomic operations. The x86 memory model still prevents the processor from doing most reordering, but the compiler can cause problems. Therefore the x86 implementation should be a compiler intrinsic to prevent compiler reordering — it should not be a NOP.

        • jfdube
          February 11, 2013 at 12:20 pm

          I was talking in the context of the SpinLock example, but you’re completely right about that in contexts without atomic operations. Thanks for you comment!

  4. March 11, 2012 at 11:21 pm

    Great read, thanks! I wish you continue sharing your experience on such interesting subjects.

  5. February 10, 2013 at 2:39 pm

    You say, in regards to lwsync “There is no distinction between the different semantics, so it acts as a full memory barrier (fence).”

    That is incorrect. lwsync works as a read barrier (it stops reads from passing reads) and as a write barrier (it stops writes from passing writes) but it is not a #StoreLoad barrier. That is, a later read may be processed in front of an earlier write. Therefore it is not a full memory barrier.

    In general, #StoreLoad barriers are extremely expensive. That’s why the x86 memory model allows that type of reordering (while normally prohibiting all other types) and that’s why lwsync doesn’t prevent it. You need full sync to get a full barrier. See this article for more details:

    http://preshing.com/20120913/acquire-and-release-semantics

    • jfdube
      February 11, 2013 at 12:29 pm

      Thanks for the comment Bruce, I updated the text.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: