I am trying to use some of the uncore hardware counters, such as: skx_unc_imc0-5::UNC_M_WPQ_INSERTS
. It's supposed to count the number of allocations into the Write Pending Queue. The machine has 2 Intel Xeon Gold 5218 CPUs with cascade lake architecture, with 2 memory controllers per CPU. linux version is 5.4.0-3-amd64. I have the following simple loop and I am reading this counter for it. Array elements are 64 byte in size, equal to cache line.
for(int i=0; i < 1000000; i++){
array[i].value=2;
}
For this loop, when I map memory to DRAM NUMA node, the counter gives around 150,000 as a result, which maybe makes sense: There are 6 channels in total for 2 memory controllers in front of this NUMA node, which use DRAM DIMMs in interleaving mode. Then for each channel there is one separate WPQ I believe, so skx_unc_imc0 gets 1/6 from the entire stores. There are skx_unc_imc0-5
counters that I got with papi_native_avail
, supposedly each for different channels.
The unexpected result is when instead of mapping to DRAM NUMA node, I map the program to Non-Volatile Memory, which is presented as a separate NUMA node to the same socket. There are 6 NVM DIMMs per-socket, that create one Interleaved Region. So when writing to NVM, there should be similarly 6 different channels used and in front of each, there is same one WPQ, that should get again 1/6 write inserts.
But UNC_M_WPQ_INSERTS
returns only around up 1000 as a result on NV memory. I don't understand why; I expected it to give similarly around 150,000 writes in WPQ.
Am I interpreting/understanding something wrong? Or is there two different WPQs per channel depending wether write goes to DRAM or NVM? Or what else could be the explanation?
It turns out that
UNC_M_WPQ_INSERTS
counts the number of allocations into the Write Pending Queue, only for writes to DRAM. Intel has added corresponding hardware counter for Persistent Memory:UNC_M_PMM_WPQ_INSERTS
which counts write requests allocated in the PMM Write Pending Queue for IntelĀ® Optaneā¢ DC persistent memory.However there is no such native event showing up in
papi_native_avail
which means it can't be monitored with PAPI yet. In linux version 5.4, some of the PMM counters can be directly found inperf list uncore
such asunc_m_pmm_bandwidth.write
- Intel Optane DC persistent memory bandwidth write (MB/sec), derived fromunc_m_pmm_wpq_inserts
, unit: uncore_imc. This implies that even thoughUNC_M_PMM_WPQ_INSERTS
is not directly listed inperf list
as an event, it should exist on the machine.As described here the EventCode for this counter is: 0xE7, therefore it can be used with perf as a raw hardware event descriptor as following:
perf stat -e uncore_imc/event=0xe7/
. However, it seems that it does not support event modifiers to specify user-space counting with perf. Then after pinning the thread in the same socket as the NVM NUMA node, for the program that basically only does the loop described in the question, the result ofperf
kind of makes sense:So far this seems to be the the best guess.