Can we measure successful store-forwarding with Intel's performance counters?

301 views Asked by At

Is it possible to measure the number of successful store-forwarding operations using the performance counters on recent Intel x86 chips?

I see events for ld_blocks.store_forward which measure failed store-forwarding, but it's clear to me if the successful case can be measured.

2

There are 2 answers

5
Hadi Brais On BEST ANSWER

There is no documented event to count the number of successful store forwarding operations. However, I have experimentally determined a set of undocumented events for that purpose on Haswell and Broadwell. In particular, any event with event code 0x2 and an odd value for umask (any odd number such as 1) seems to be representing the event of successful store forwarding very accurately, i.e., the counts are as expected and the standard deviation is practically zero. I think you can use the same events on later (and even earlier) microarchitectures. Again, none of these events are documented.

0
Peter Cordes On

I don't see anything more than you did for SKL, but older uarches may have more details:

For Core2 (what Intel confusingly calls the Core microarchitecture), the optimization manual documents (in B.7 EVENT RATIOS FOR INTEL CORE MICROARCHITECTURE):

B.7.5.2 4K Aliasing and Store Forwarding Block Detection

  1. Loads Blocked by Overlapping Store Rate: LOAD_BLOCK.OVERLAP_STORE/CPU_CLK_UNHALTED.CORE

4K aliasing and store forwarding block are two different scenarios in which loads are blocked by preceding stores due to different reasons. Both scenarios are detected by the same event: LOAD_BLOCK.OVERLAP_STORE. A high value for “Loads Blocked by Overlapping Store Rate” indicates that either 4K aliasing or store forwarding block may affect performance

This may count stalled and successful store-forwarding. (And 4k aliasing, so you need to avoid that or subtract it.)

B.7.5.3 Load Block by Preceding Stores

  1. Loads Blocked by Unknown Store Address Rate: LOAD_BLOCK.STA / CPU_CLK_UNHALTED.CORE

A high value for “Loads Blocked by Unknown Store Address Rate” indicates that loads are frequently blocked by preceding stores with unknown address and implies performance penalty.

  1. Loads Blocked by Unknown Store Data Rate: LOAD_BLOCK.STD / CPU_CLK_UNHALTED.CORE

A high value for “Loads Blocked by Unknown Store Data Rate” indicates that loads are frequently blocked by preceding stores with unknown data and implies performance penalty.

These last two counters would appear to count successful store forwarding, but only in cases where the load actually had to wait after detecting the (possible) overlap.