What (bad) can happen if I don't issue _mm_sfence() after _mm_clflushopt()?

243 views Asked by At

I'm evicting a memory range from CPU cache before freeing the memory. Ideally I would like to just abandon these cache lines without saving them to memory. Because noone is going to use those values, and whoever obtains again that memory range (after malloc()/new/_mm_malloc() etc.) will first fill the memory with new values. As this question suggests, there seems no way to achieve the ideal on x86_64 currently.

Therefore I'm doing _mm_clflushopt(). As I understood, after _mm_clflushopt() I need to call _mm_sfence() to make its non-temporal stores visible to other cores/processors. But in this specific case I don't need its stores.

So if I just don't call _mm_sfence(), can something bad happen? E.g. if some other core/processor manages to allocate that memory range again quickly enough, and starts filling it with new data, can it happen that the new data gets concurrently overwritten by the old cache being flushed by the current core?

EDIT: the quick subsequent allocation is unlikely, I'm just describing this case because I need the program to be correct there too.

1

There are 1 answers

2
Peter Cordes On

clflushopt is a terrible idea for this use-case. Evicting lines from the cache before overwriting them is the opposite of what you want. If they're hot in cache, you avoid a RFO (read-for-ownership).

If you're using NT stores, they will evict any lines that were still hot so it's not worth spending cycles doing clflushopt first.

If not, you're completely shooting yourself in the foot by guaranteeing the worst case. See Enhanced REP MOVSB for memcpy for more about writing to memory, and RFO vs. no-RFO stores. (e.g. rep movsb can do no-RFO stores on Intel at least, but still leave the data hot in cache.) And keep in mind that an L3 hit can satisfy an RFO faster than going to DRAM.

If you're about to write a buffer with regular stores (that will RFO), you might prefetchw on it to get it into Exclusive state in your L1D before you're ready to actually write.

It's possible that clwb (Cache-Line Write Back (without evicting)) would be useful here, but I think prefetchw will always be at least as good as that, if not better (especially on AMD where MOESI cache coherency can transfer dirty lines between caches, so you could get a line into your L1D that's still dirty, and be able to replace that data without ever sending the old data to DRAM.)

Ideally, malloc will give you memory that's still hot in the L1D cache of the current core. If you're finding that a lot of the time, you're getting buffers that are still dirty and in L1D or L2 on another core, then look into a malloc with per-thread pools or some kind of NUMA-like thread awareness.

As I understood, after _mm_clflushopt() I need to call _mm_sfence() to make its non-temporal stores visible to other cores/processors.

No, don't think of clflushopt as a store. It's not making any new data globally visible, so it doesn't interact with the global ordering of memory operations.

sfence makes your thread's later stores wait until the flushed data is flushed all the way to DRAM or memory-mapped non-volatile storage.

If you're flushing lines that are backed by regular DRAM, you only need sfence before a store that will initiate a non-coherent DMA operation that will read DRAM contents without checking cache. Since other CPU cores do always go through cache, sfence is not useful or necessary for you. (Even if clflushopt was a good idea in the first place.)


Even if you were talking about actual NT stores, other cores will eventually see your stores without sfence. You only need sfence if you need to make sure they see your NT stores before they see some later stores. I explained this in Make previous memory stores visible to subsequent memory loads

can something bad happen?

No, clflushopt doesn't affect cache coherency. It just triggers write-back (& eviction) without making later stores/loads to wait for it.

You could clflushopt memory allocated and in use by another thread without affecting correctness.