I'm evicting a memory range from CPU cache before freeing the memory. Ideally I would like to just abandon these cache lines without saving them to memory. Because noone is going to use those values, and whoever obtains again that memory range (after malloc()
/new
/_mm_malloc()
etc.) will first fill the memory with new values. As this question suggests, there seems no way to achieve the ideal on x86_64 currently.
Therefore I'm doing _mm_clflushopt()
. As I understood, after _mm_clflushopt()
I need to call _mm_sfence()
to make its non-temporal stores visible to other cores/processors. But in this specific case I don't need its stores.
So if I just don't call _mm_sfence()
, can something bad happen? E.g. if some other core/processor manages to allocate that memory range again quickly enough, and starts filling it with new data, can it happen that the new data gets concurrently overwritten by the old cache being flushed by the current core?
EDIT: the quick subsequent allocation is unlikely, I'm just describing this case because I need the program to be correct there too.
clflushopt
is a terrible idea for this use-case. Evicting lines from the cache before overwriting them is the opposite of what you want. If they're hot in cache, you avoid a RFO (read-for-ownership).If you're using NT stores, they will evict any lines that were still hot so it's not worth spending cycles doing
clflushopt
first.If not, you're completely shooting yourself in the foot by guaranteeing the worst case. See Enhanced REP MOVSB for memcpy for more about writing to memory, and RFO vs. no-RFO stores. (e.g.
rep movsb
can do no-RFO stores on Intel at least, but still leave the data hot in cache.) And keep in mind that an L3 hit can satisfy an RFO faster than going to DRAM.If you're about to write a buffer with regular stores (that will RFO), you might
prefetchw
on it to get it into Exclusive state in your L1D before you're ready to actually write.It's possible that
clwb
(Cache-Line Write Back (without evicting)) would be useful here, but I thinkprefetchw
will always be at least as good as that, if not better (especially on AMD where MOESI cache coherency can transfer dirty lines between caches, so you could get a line into your L1D that's still dirty, and be able to replace that data without ever sending the old data to DRAM.)Ideally,
malloc
will give you memory that's still hot in the L1D cache of the current core. If you're finding that a lot of the time, you're getting buffers that are still dirty and in L1D or L2 on another core, then look into a malloc with per-thread pools or some kind of NUMA-like thread awareness.No, don't think of
clflushopt
as a store. It's not making any new data globally visible, so it doesn't interact with the global ordering of memory operations.sfence
makes your thread's later stores wait until the flushed data is flushed all the way to DRAM or memory-mapped non-volatile storage.If you're flushing lines that are backed by regular DRAM, you only need
sfence
before a store that will initiate a non-coherent DMA operation that will read DRAM contents without checking cache. Since other CPU cores do always go through cache,sfence
is not useful or necessary for you. (Even ifclflushopt
was a good idea in the first place.)Even if you were talking about actual NT stores, other cores will eventually see your stores without
sfence
. You only needsfence
if you need to make sure they see your NT stores before they see some later stores. I explained this in Make previous memory stores visible to subsequent memory loadsNo,
clflushopt
doesn't affect cache coherency. It just triggers write-back (& eviction) without making later stores/loads to wait for it.You could
clflushopt
memory allocated and in use by another thread without affecting correctness.