Core A writes value x to storebuffer, waiting invalid ack and then flushes x to cache. Does it wait only one ack or wait all acks ? And how does it konw how many acks in all CPUs ?
When CPU flush value in storebuffer to L1 Cache?
341 views Asked by Pengcheng At
1
There are 1 answers
Related Questions in ATOMIC
- Thread-safe lock-free min where both operands can change c++
- In Rust, what is the lock-free alternative of Arc<RefCell<T>> if T is already Sync & lock-free?
- Prevent reordering of prefetch instruction in c++
- What can be inferred according to the result of atomic operations?
- How can atomicModifyIORef cause leaks? And why does atomicModifyIORef' solve the problem?
- Critical section control with atomics
- Handling Concurrency, Overflow, and Periodic Draining in a Rust HashMap Collection
- Atomic Operations do not provide updated value to other threads
- memory order with multiple stores
- Why std::mutex of c++11 has no memory order?
- Pausing threads for synchronized operation
- EJB transactions behaving differently on Wildfly 8 between Windows and Linux deployments
- Atomically reorder huge list of documents in firestore
- Atomic increment does not work as expected in interrupt
- Most efficient way to signal consumer thread from multiple producer threads using condition variables
Related Questions in CPU-ARCHITECTURE
- What is causing the store latency in this program?
- what's the difference between "nn layout" and "nt layout"
- Will a processor with such a defect work?
- How do i find number of Cycles of a processor?
- Why does LLVM-MCA measure an execution stall?
- Can out-of-order execution of CPU affect the order of new operator in C++?
- running SPEC in gem5 using the SimPoint methodology
- Why don't x86-64 (or other architectures) implement division by 10?
- warn: MOVNTDQ: Ignoring non-temporal hint, modeling as cacheable!, While simulating x86 with spec2006 benchamrks I am getting stuck in warn message
- arithmetic intensity of zgemv versus dgemv/sgemv?
- What is the microcode scoreboard?
- Why don't x86/ARM CPU just stop speculation for indirect branches when hardware prediction is not available?
- Question about the behaviour of registers
- How to increase throughput of random memory reads/writes on multi-GB buffers?
- RISVC Single Cycle Processor Data Path and Testbench
Related Questions in CPU-CACHE
- How CPUs Use the LOCK Prefix to Implement Cache Locking and ensure memory consistency
- How to check whether the PCIe Memory-mapped BAR region is cacheable or uncacheable
- Are RISC-V SH and SB instructions allowed to communicate with the cache?
- for remote socket cache-to-cache data transfer, why data homed in reader socket shows higher latency than data homed in writer socket?
- Performance implications of aliasing in VIPT cache
- Why do fast memory writes when run over multiple threads take much more time vs when they are run on a single thread?
- question regarding the behavior of the program in Meltdown attack
- Seeking Verification: MIPS Cache Set Update Analysis
- OS cache/memory hierarchy: How does writing to a new file work?
- Can there be a cache block with the same Tag-ID in different Sets?
- is it a way to do a "store" operation without fetching in case of cache miss
- why is there a need to stop prefetching to pages when a write happens to it?
- is it possible that a cpu has several L3 level caches?
- Are 64-byte CPU cache line reads aligned on 64-byte boundaries?
- how cpu cache when physical address is not contiguous
Related Questions in MESI
- How CPUs Use the LOCK Prefix to Implement Cache Locking and ensure memory consistency
- Can CPU load data from another CPU's cache using LOCK CMPXCHG instruction in x86?
- How `memory_order_relaxed` is enough in TTAS spinlock for Arm64?
- Invalidation of an Exclusive cache line
- Is it Possible for a Thread to Read Stale Data Due to CPU Core Switching in a Multi-threaded Environment?
- Confusing "Memory Barrier Example 1" in 《Memory Barriers: a Hardware View for Software Hackers》?
- optimal to flush low-contention atomic from caches?
- How is message queue implemented in cache coherence protocol?
- Does Cache Coherence always prevent reading a stale value? Do invalidation queues allow it?
- Data races with MESI optimization
- Shortcomings of cache coherence alternative
- cache coherence - Why are some steps considered exclusive?
- how to fix the problem while cpu store buffer cause data unconsistency?
- Cache coherence systems from a timing point of view
- Cache coherence state machine
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Popular Tags
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
It isn't clear to me what you mean by "invalid ack", but let's assume you mean a snoop/invalidation originating from another core which is requesting ownership of the same line.
In this case, the stores in the store buffer are generally free to ignore such invalidations from other cores since the stores in the store buffer are not yet globally visible. The store only become globally visible when they commit to L1 at some point after they have retired. At this point1 the cache controller will make an RFO (request for ownership) of the associated line if it isn't already in the cache. It is essentially at this point that the store becomes globally visible. The L1 cache controller doesn't need to know how many other invalidations are in flight, because they are being mediated by some higher level components in the system as part of the MESI protocol, and when they get the line in the E state, they are guaranteed they are the exclusive owner.
In short, invalidations from other cores have little effect on stores in the store buffer2, since they become globally visible at a single point based on an RFO request. Is is loads that have executed that area more likely to be made by invalid activity on another core, especially on strongly platforms such as x86 which doesn't allow visible load-load reordering. The so-called MOB on x86, for example, is responsible for tracking whether invalidations potentially break the ordering rules.
RFO Response
Perhaps the "acks" you were talking about are the responses from other cores to the writing core's request to obtain or upgrade its ownership of the line so that it can write to it: i.e., invaliding copies of the lines in the other CPUs and so on.
This is commonly known as issuing an RFO which when successful leaves the line in the E state in the requesting core.
Most CPUs are layered, with a variety of different agents working together to ensure coherency. In practice, this means that a CPU doens't need to wait for up to N-1 "acks" from the other N-1 cores on an N CPU system, but rather just a single reply from a higher-level component which itself is in charge of sending and collecting responses from other CPUs.
One example could be a single-socket multi-core CPU with a private L1 and L2, and shared L3. A core might send its RFO down to the L3, which might send invalidate requests to all cores, wait for their responses and then acknowledge the RFO request to the requesting core. Alternately, the L3 may store some bits which indicate which cores could possibly have a copy of the line, and then it only needs to send the requests to those cores (the role the L3 is taking in that case is sometimes referred to as a snoop filer).
Since all communication between agents passes through the L3, it is able to keep anything consistent. In the case of a multi-socket system, things get more complicated: the L3 on the local core may again get the request and may pass it over to the other socket to do the same type of invalidation there. Again there might exist the concept of a snoop filter, or other concepts may exist and the behavior may even be configurable!
For example, in Intel's Broadwell Xeon architecture, there are fully four different configurable snoop modes:
... with different performance tradeoffs:
The rest that document goes into some detail about how the various modes work.
So I guess the short answer is "it's complicated and depends on the detailed design and possibly even user-configurable settings".
1 Or potentially at some earlier point since an optimized implementation might "look ahead" in the store buffer and issue RFOs (so-called "RFO prefetches") for upcoming stores even before they become the most senior store.
2 Invalidations may, however, complicate the RFO prefetches mentioned in the first footnote, since it means there is a window where line can be "stolen back" by another core, making the RFO prefetch wasted work. A sophisticated implementation may have a predictor that varies the RFO prefetch aggressiveness based on monitoring whether this occurs.