The notion of a compiler fence often comes up when I'm reading about memory models, barriers, ordering, atomics, etc., but normally it's in the context of also being paired with a CPU fence, as one would expect.
Occasionally, however, I read about fence constructs which only apply to the compiler. An example of this is the C++11 std::atomic_signal_fence
function, which states at cppreference.com:
std::atomic_signal_fence is equivalent to std::atomic_thread_fence, except no CPU instructions for memory ordering are issued. Only reordering of the instructions by the compiler is suppressed as order instructs.
I have five questions related to this topic:
As implied by the name
std::atomic_signal_fence
, is an asynchronous interrupt (such as a thread being preempted by the kernel to execute a signal handler) the only case in which a compiler-only fence is useful?Does its usefulness apply to all architectures, including strongly-ordered ones such as
x86
?Can a specific example be provided to demonstrate the usefulness of a compiler-only fence?
When using
std::atomic_signal_fence
, is there any difference between usingacq_rel
andseq_cst
ordering? (I would expect it to make no difference.)This question might be covered by the first question, but I'm curious enough to ask specifically about it anyway: Is it ever necessary to use fences with
thread_local
accesses? (If it ever would be, I would expect compiler-only fences such asatomic_signal_fence
to be the tool of choice.)
Thank you.
To answer all 5 questions:
1) A compiler fence (by itself, without a CPU fence) is only useful in two situations:
To enforce memory order constraints between a single thread and asynchronous interrupt handler bound to that same thread (such as a signal handler).
To enforce memory order constraints between multiple threads when it is guaranteed that every thread will execute on the same CPU core. In other words, the application will only run on single core systems, or the application takes special measures (through processor affinity) to ensure that every thread which shares the data is bound to the same core.
2) The memory model of the underlying architecture, whether it's strongly- or weakly-ordered, has no bearing on whether a compiler-fence is needed in a situation.
3) Here is pseudo-code which demonstrates the use of a compiler fence, by itself, to sufficiently synchronize memory access between a thread and an async signal handler bound to the same thread:
Important Note: This example assumes that
async_signal_handler
is bound to the same thread that initializesshared_data
and sets theis_initialized
flag, which means the application is single-threaded, or it sets thread signal masks accordingly. Otherwise, the compiler fence would be insufficient, and a CPU fence would also be needed.4) They should be the same.
acq_rel
andseq_cst
should both result in a full (bidirectional) compiler fence, with no fence-related CPU instructions emitted. The concept of "sequential consistency" only comes into play when multiple cores and threads are involved, andatomic_signal_fence
only pertains to one thread of execution.5) No. (Unless of course, the thread-local data is accessed from an asynchronous signal handler in which case a compiler fence might be necessary.) Otherwise, fences should never be needed with thread-local data since the compiler (and CPU) are only allowed to reorder memory accesses in ways that do not change the observable behavior of the program with respect to its sequence points from a single-threaded perspective. And one can logically think of thread-local statics in a multi-threaded program to be the same as global statics in a single-threaded program. In both cases, the data is only accessible from a single thread, which prevents a data race from occuring.