How does x86 handle store conditional instructions?

1.6k views Asked by At

I am trying to find out what an x86 processor does when it encounters a store conditional instruction. For instance does it stall the front end of the pipeline and wait for the ROB buffer to become empty before it stops stalling the front end and execute the SC? Basically does it force the processor to become non speculative...

3

There are 3 answers

0
ebo On

A (generic) x86 processor does none of the things you mentioned. It just fetches one instruction after another and executes them.

Everything else is handled transparently and heavily depends on which processor you are looking at, so there is no generic answer to your question.

If you are interested in methods around stalling problems you should start at the wikipedia page on x86 (register renaming to mention one. Just throw away results from the non-taken branch).

4
Nathan Fellman On

I'm guessing that you're referring to the CMOVcc instructions.

I don't know about older x86 processors, but modern ones (ever since they became speculative and out of order) implement conditional stores as:

old value = mem[dest address]
if (condition) 
    mem[dest address] = new value
else
    mem[dest address] = old value

The condition part can be implemented in hardware like this:

      cond
    |\ |
----| \|
new |  \
    |   |    dest
    |   |---------
    |   |     |
  __|  /      |
 |  | /       |
 |  |/        |
 |____________|

So there's no need to break speculation. A store will in fact take place. The condition determines if the data to be written will be the old value or a new one.

0
Peter Cordes On

Unlike ARM and many other RISCs, x86 doesn't have load-linked / store-conditional; architecturally it has stuff like lock add byte [rdi], 1 or lock cmpxchg [rdi], ecx for atomic RMW. See Is incrementing an int effectively atomic in specific cases? for some details of the semantics and CPU architecture.

See also x86 equivalent for LWARX and STWCX - arbitrary atomic RMW operations can by synthesized with a CAS (lock cmpxchg) retry loop. Unlike LL/SC, it is susceptible to ABA problems, but CAS is the other major way of providing a building block for atomic stuff.


Internally on x86 modern CPUs, this probably works by running a load uop that also "locks" that cache line. (Instead of arming a monitor so a later SC will fail, the "cache lock" prevents MESI responses until a store-unlock, preventing things that would have made an SC fail on an LL/SC machine.)

Taking a cache lock on just that line in MESI Modified state (instead of the traditional bus lock) depends on it being cacheable memory, and being aligned or at least not splitting across a cache-line boundary.


x86's cmov instruction only has one form, with a register destination, not memory. cmovcc reg, reg/mem. Even with a memory source, it's an unconditional load to feed an ALU select operation, so will segfault on a bad address even if the condition is false. (Unlike ARM predicated instructions, where the whole instruction is NOPed out on a false condition.)

I guess you could say lock cmpxchg [mem], reg is a conditional store, but the only condition possible is whether the old contents of memory match AL/AX/EAX/RAX. https://www.felixcloutier.com/x86/cmpxchg

rep stosb/w/d/q is also a conditional store, if you arrange for RCX to be 0 or 1 (e.g. xor ecx,ecx / set FLAGS / setcc cl); microcode branching isn't branch-predicted so it's a bit different from normal branching.

AVX vmaskmovps or AVX-512 masked stores are truly conditional stores, based on a mask condition. My answer on another Q&A about cmov discusses the conditional-load equivalents of these, along with the fact that cmov is not a conditional load, it's an ALU select that needs all 3 inputs (FLAGS and 2 integers).

Conditional stores are rare in most ISAs other than the SC part of a LL/SC pair. 32-bit ARM is an exception to the rule; see Why are conditionally executed instructions not present in later ARM instruction sets? for why AArch64 dropped it.


AVX and AVX-512 masked stores do not stall the pipeline. See https://agner.org/optimize/ and https://uops.info/ for some performance numbers, plus Intel's optimization manual. They suppress faults on masked elements. Store-forwarding from them if you reload before they commit to L1d might stall that load, but not the whole pipeline.


Intel APX (Advanced Performance Extensions) adds REX2 and EVEX prefixes for legacy integer instructions like sub, and some new encodings of cmov that actually do suppress faults on load with a false condition, and a conditional-store version. They use the mnemonic CFCMOVcc, Conditionally Faulting CMOV. Intel finally decided to make an extension that required 64-bit mode, using some of the coding space freed up by removing BCD and other opcodes.

Presumably the hardware handles conditional load/store similar to AVX-512 masking.