Maybe it's just me, but the example in the man 2
page for membarrier
seems pointless.
Basically, membarrier()
is an asynchronous memory barrier that, given two coordinating pieces of code (let's call then fast path and slow path) allows you to move all the hardware cost of the the barrier to the slow path, and leave the fast path only with a compiler barrier1. There are a few different ways to accomplish the membarrier
behavior, such as sending an IPI to each involved processor or wait for the code running on each processor to be de-scheduled - but the exact implementation details aren't important here.
Now, here's the example transformation given in the man page:
Original Code
static volatile int a, b;
static void
fast_path(void)
{
int read_a, read_b;
read_b = b;
asm volatile ("mfence" : : : "memory");
read_a = a;
/* read_b == 1 implies read_a == 1. */
if (read_b == 1 && read_a == 0)
abort();
}
static void
slow_path(void)
{
a = 1;
asm volatile ("mfence" : : : "memory");
b = 1;
}
Transformed Code
(some syscall and init boilerplate omitted)
static volatile int a, b;
static void
fast_path(void)
{
int read_a, read_b;
read_b = b;
asm volatile ("" : : : "memory");
read_a = a;
/* read_b == 1 implies read_a == 1. */
if (read_b == 1 && read_a == 0)
abort();
}
static void
slow_path(void)
{
a = 1;
membarrier(MEMBARRIER_CMD_SHARED, 0);
b = 1;
}
Here the slow_path
is doing two writes (a
, then b
) separated by a barrier, and the fast_path
is doing two reads (b
, then a
) also separated by a barrier.
The x86 memory model, however, doesn't permit load-load or store-store reordering! So as far as I can tell, membarrier()
isn't needed at all in this scenarios and also the mfence
wasn't needed in the original code. It seems that simple compiler barriers would have sufficed in both places2.
An example that actually makes sense, IMO, should have a store followed by a load, separated by a barrier in the fast path.
Am I missing something?
1 A compiler barrier prevents movement of loads or stores across it by the compiler (and depending on the implementation may force some register values to memory), but doesn't emit any type of atomic operation or memory fence, so avoid the often order-of-magnitude slowdown inherent in those instructions.
2 Of course on weaker platforms, where load-load reordering can occur, the example might make sense, but the example is explicitly x86 and membarrier()
is only implemented on x86.
You are correct. On x86, this particular use of
membarrier()
is completely useless. In fact, this exact example is the very first one given in the Intel SDM to illustrate x86 memory ordering rules: