What are the costs of a failed store-to-load forwarding on recent x86 architectures?
In particular, store-to-load forwarding that fails because the load partly overlaps an earlier store, or because the earlier load or store cross some alignment boundary that causes the forwarding to fail.
Certainly there is a latency cost: how big is it? Is there also a throughput cost, e.g., does a failed store-to-load forwarding use additional resources that are then unavailable to other loads and stores, or even other non-memory operations?
Is there a difference when all the parts of the store come from the store buffer, versus the case where it's a mix of the store buffer and L1?
It is not really a full answer, but still evidence that the penalty is visible.
MSVC 2022 benchmark, compiler with
/std:c++latest
.CPU:
Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz 2.21 GHz
I interpret the results as follows: