As I understand, when a CPU speculatively executes a piece of code, it "backs up" the register state before switching to the speculative branch, so that if the prediction turns out wrong (rendering the branch useless) -- the register state would be safely restored, without damaging the "state".
So, my question is: can a speculatively executed CPU branch contain opcodes that access RAM?
I mean, accessing the RAM isn't an "atomic" operation - one simple opcode reading from memory can cause actual RAM access, if the data isn't currently located in the CPU cache, which might turn out as an extremely time consuming operation, from the CPU perspective.
And if such access is indeed allowed in a speculative branch, is it only for read operations? Because, I can only assume that reverting a write operation, depending on it's size, might turn out extremely slow and tricky if a branch is discarded and a "rollback" is performed. And, for sure, read/write operations are supported, to some extent at least, due to the fact that the registers themselves, on some CPUs, are physically located on the CPU cache as I understand.
So, maybe a more precise formulation would be: what are the limitations of a speculatively executed piece of code?
The cardinal rules of speculative out-of-order (OoO) execution are:
OoO exec is normally implemented by treating everything as speculative until retirement. Every load or store could fault, every FP instruction could raise an FP exception. Branches are special (compared to exceptions) only in that branch mispredicts are not rare, so a special mechanism to handle early detection and roll-back for branch misses is helpful.
Yes, cacheable loads can be executed speculatively and OoO because they have no side effects.
Store instructions can also be executed speculatively thanks to the store buffer. The actual execution of a store just writes the address and data into the store buffer. (related: Size of store buffers on Intel hardware? What exactly is a store buffer? gets more techincal than this, with more x86 focus. This answer is I think applicable to most ISAs.)
Commit to L1d cache happens some time after the store instruction retires from the re-order buffer (ROB), i.e. when the store is known to be non-speculative, the associated store-buffer entry "graduates" and becomes eligible to commit to cache and become globally visible. A store buffer decouples execution from anything other cores can see, and also insulates this core from cache-miss stores so it's a very useful feature even on in-order CPUs.
Before a store-buffer entry "graduates", it can just be discarded along with the ROB entry that points to it, when rolling back on mis-speculation.
(This is why even strongly-ordered hardware memory models still allow StoreLoad reordering https://preshing.com/20120930/weak-vs-strong-memory-models/ - it's nearly essential for good performance not to make later loads wait for earlier stores to actually commit.)
The store buffer is effectively a circular buffer: entries allocated by the front-end (during alloc/rename pipeline stage(s)) and released upon commit of the store to L1d cache. (Which is kept coherent with other cores via MESI).
Strongly-ordered memory models like x86 can be implemented by doing commit from the store buffer to L1d in order. Entries were allocated in program order, so the store buffer can basically be a circular buffer in hardware. Weakly-ordered ISAs can look at younger entries if the head of the store buffer is for a cache line that isn't ready yet.
Some ISAs (especially weakly ordered) also do merging of store buffer entries to create a single 8-byte commit to L1d out of a pair of 32-bit stores, for example.
Reading cacheable memory regions is assumed to have no side effects and can be done speculatively by OoO exec, hardware prefetch, or whatever. Mis-speculation can "pollute" caches and waste some bandwidth by touching cache lines that the true path of execution wouldn't (and maybe even triggering speculative page-walks for TLB misses), but that's the only downside1.
MMIO regions (where reads do have side-effects, e.g. making a network card or SATA controller do something) need to be marked as uncacheable so the CPU knows that speculative reads from that physical address are not allowed. If you get this wrong, your system will be unstable - my answer there covers a lot of the same details you're asking about for speculative loads.
High performance CPUs have a load buffer with multiple entries to track in-flight loads, including ones that miss in L1d cache. (Allowing hit-under-miss and miss-under-miss even on in-order CPUs, stalling only if/when an instruction tries to read load-result register that isn't ready yet).
In an OoO exec CPU, it also allows OoO exec when one load address is ready before another. When data eventually arrives, instructions waiting for inputs from the load result become ready to run (if their other input was also ready). So the load buffer entries have to be wired up to the scheduler (called the reservation station in some CPUs).
See also About the RIDL vulnerabilities and the "replaying" of loads for more about how Intel CPUs specifically handle uops that are waiting by aggressively trying to start them on the cycle when data might be arriving from L2 for an L2 hit.
Footnote 1: This downside, combined with a timing side-channel for detecting / reading micro-architectural state (cache line hot or cold) into architectural state (register value) is what enables Spectre. (https://en.wikipedia.org/wiki/Spectre_(security_vulnerability)#Mechanism)
Understanding Meltdown as well is very useful for understanding the details of how Intel CPUs choose to handle fault-suppression for speculative loads that turn out to be on the wrong path. http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/
Yes, by decoding them to separate logically separate load / ALU / store operations, if you're talking about modern x86 that decodes to instructions uops. The load works like a normal load, the store puts the ALU result in the store buffer. All 3 of the operation can be scheduled normally by the out-of-order back end, just like if you'd written separate instructions.
If you mean atomic RMW, then that can't really be speculative. Cache is globally visible (share requests can come at any time) and there's no way to roll it back (well, except whatever Intel does for transactional memory...). You must not ever put a wrong value in cache. See Can num++ be atomic for 'int num'? for more about how atomic RMWs are handled, especially on modern x86, by delaying response to share / invalidate requests for that line between the load and the store-commit.
However, that doesn't mean that
lock add [rdi], eax
serializes the whole pipeline: Are loads and stores the only instructions that gets reordered? shows that speculative OoO exec of other independent instructions can happen around an atomic RMW. (vs. what happens with an exec barrier likelfence
that drains the ROB).Many RISC ISAs only provide atomic RMW via load-linked / store-conditional instructions, not a single atomic RMW instruction.
Huh? False premise, and that logic doesn't make sense. Cache has to be correct at all times because another core could ask you to share it at any moment. Unlike registers which are private to this core.
Register files are built out of SRAM like cache, but are separate. There are a few microcontrollers with SRAM memory (not cache) on board, and the registers are memory-mapped using the early bytes of that space. (e.g. AVR). But none of that seems at all relevant to out-of-order execution; cache lines that are caching memory are definitely not the same ones that are being used for something completely different, like holding register values.
It's also not really plausible that a high-performance CPU that's spending the transistor budget to do speculative execution at all would combine cache with register file; then they'd compete for read/write ports. One large cache with the sum total read and write ports is much more expensive (area and power) than a tiny fast register file (many read/write ports) and a small (like 32kiB) L1d cache with a couple read ports and 1 write port. For the same reason we use split L1 caches, and have multi-level caches instead of just one big private cache per core in modern CPUs. Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?
Related reading / background: