PAPI performance counters issues on a AMD Opteron 6172

816 views Asked by At

I've been trying to analyze certain applications(written in C) with performance counters on a AMD Opteron 6172 processor, running Red Hat Enterprise Linux Workstation release 6.2 (Santiago).

I'm using PAPI v4.1.3.0 which uses the AMD native events CPU_CLK_UNHALTED for PAPI_TOT_CYC(counting total cycles) and DATA_CACHE_ACCESSES for PAPI_L1_DCA (counting L1 data cache accesses).

The problems I've been experiencing is that the number of cache accesses have been higher than the total number of cycles in some cases. A cache access does not halt the cpu, to my understanding, so it should fit within the total cycles. Also when dividing the total cycles by the clock frequency of the Opteron 6172 I get a pretty accurate estimate of the execution time, which makes me think that the total cycles is ok and the problem has to be with the counting of the data cache accesses.

I've initiated everything according to the papi examples and dont get any errors what so ever. Any help or reason to why this can occur is greatly appreciated, thanks in advance.

http://support.amd.com/us/Processor_TechDocs/31116.pdf

  • CPU_CLK_UNHALTED

The number of clocks that the CPU is not in a halted state (due to STPCLK or a HLT instruction). Note: this event allows system idle time to be automatically factored out from IPC (or CPI) measurements, providing the OS halts the CPU when going idle. If the OS goes into an idle loop rather than halting, such calculations are influenced by the IPC of the idle loop.

  • DATA_CACHE_ACCESSES

The number of accesses to the data cache for load and store references. This may include certain microcode scratchpad accesses, although these are generally rare. Each increment represents an eight-byte access, although the instruction may only be accessing a portion of that. This event is a speculative event.

1

There are 1 answers

2
dx_mrt On

Ok, here are my guesses:

  1. cache accesses may imply RAM memory accesses if the data isn't in cache, thus possibly stalling the CPU. Try measuring the last level cache (LLC) misses, one LLC miss implies one access to RAM memory.

  2. are there any other programs executing at the same time? If there are, they may be stalling the processor or generating the cache misses you're measuring.

  3. I'm pretty sure that you can issue one load and one store instruction per clock cycle, thus having 2 cache accesses/clock cycle isn't that weird...

Hope it was helpful...