When is a compiler-only memory barrier (such as std::atomic_signal_fence) useful?

6.3k views Asked by At

The notion of a compiler fence often comes up when I'm reading about memory models, barriers, ordering, atomics, etc., but normally it's in the context of also being paired with a CPU fence, as one would expect.

Occasionally, however, I read about fence constructs which only apply to the compiler. An example of this is the C++11 std::atomic_signal_fence function, which states at cppreference.com:

std::atomic_signal_fence is equivalent to std::atomic_thread_fence, except no CPU instructions for memory ordering are issued. Only reordering of the instructions by the compiler is suppressed as order instructs.

I have five questions related to this topic:

  1. As implied by the name std::atomic_signal_fence, is an asynchronous interrupt (such as a thread being preempted by the kernel to execute a signal handler) the only case in which a compiler-only fence is useful?

  2. Does its usefulness apply to all architectures, including strongly-ordered ones such as x86?

  3. Can a specific example be provided to demonstrate the usefulness of a compiler-only fence?

  4. When using std::atomic_signal_fence, is there any difference between using acq_rel and seq_cst ordering? (I would expect it to make no difference.)

  5. This question might be covered by the first question, but I'm curious enough to ask specifically about it anyway: Is it ever necessary to use fences with thread_local accesses? (If it ever would be, I would expect compiler-only fences such as atomic_signal_fence to be the tool of choice.)

Thank you.

2

There are 2 answers

2
MikeTusar On BEST ANSWER

To answer all 5 questions:


1) A compiler fence (by itself, without a CPU fence) is only useful in two situations:

  • To enforce memory order constraints between a single thread and asynchronous interrupt handler bound to that same thread (such as a signal handler).

  • To enforce memory order constraints between multiple threads when it is guaranteed that every thread will execute on the same CPU core. In other words, the application will only run on single core systems, or the application takes special measures (through processor affinity) to ensure that every thread which shares the data is bound to the same core.


2) The memory model of the underlying architecture, whether it's strongly- or weakly-ordered, has no bearing on whether a compiler-fence is needed in a situation.


3) Here is pseudo-code which demonstrates the use of a compiler fence, by itself, to sufficiently synchronize memory access between a thread and an async signal handler bound to the same thread:

void async_signal_handler()
{
    if ( is_shared_data_initialized )
    {
        compiler_only_memory_barrier(memory_order::acquire);
        ... use shared_data ...
    }
}

void main()
{
// initialize shared_data ...
    shared_data->foo = ...
    shared_data->bar = ...
    shared_data->baz = ...
// shared_data is now fully initialized and ready to use
    compiler_only_memory_barrier(memory_order::release);
    is_shared_data_initialized = true;
}

Important Note: This example assumes that async_signal_handler is bound to the same thread that initializes shared_data and sets the is_initialized flag, which means the application is single-threaded, or it sets thread signal masks accordingly. Otherwise, the compiler fence would be insufficient, and a CPU fence would also be needed.


4) They should be the same. acq_rel and seq_cst should both result in a full (bidirectional) compiler fence, with no fence-related CPU instructions emitted. The concept of "sequential consistency" only comes into play when multiple cores and threads are involved, and atomic_signal_fence only pertains to one thread of execution.


5) No. (Unless of course, the thread-local data is accessed from an asynchronous signal handler in which case a compiler fence might be necessary.) Otherwise, fences should never be needed with thread-local data since the compiler (and CPU) are only allowed to reorder memory accesses in ways that do not change the observable behavior of the program with respect to its sequence points from a single-threaded perspective. And one can logically think of thread-local statics in a multi-threaded program to be the same as global statics in a single-threaded program. In both cases, the data is only accessible from a single thread, which prevents a data race from occuring.

0
user2949652 On

There are actually some nonportable but useful C programming idioms where compiler fences are useful, even in multicore code (particularly in pre-C11 code). The typical situation is where the program is doing some accesses that would normally be made volatile (because they are to shared variables), but you want the compiler to be able to move the accesses around. If you know that the accesses are atomic on the target platform (and you take some other precautions), you can leave the accesses nonvolatile, but contain code movement using compiler barriers.

Thankfully, most programming like this is made obsolete with C11/C++11 relaxed atomics.