Unique matrix transpose problem: contradictory reports from cachegrind and perf

115 views Asked by At

In the following question, we're talking about an algorithm which transposes a matrix of complex values struct complex {double real = 0.0; double imag = 0.0;};. Owing to a special data-layout, there is a stride-n*n access between the rows, which means that the loading of a subsequent row causes the eviction of the previously loaded row. All runs have been done using 1 thread only.

I'm trying to understand why my 'optimized' transpose function, which makes use of 2D blocking, is performing badly (coming from: 2D blocking with unique matrix transpose problem) and so I'm trying to use performance counters/cache simulators to get a reading on what's going wrong.

According to my analysis, if n=500 is the size of the matrix, b=4 is my block-size and c=4 is my cache-line size, we have for the naive algorithm,

for (auto i1 = std::size_t{}; i1 < n1; ++i1)
{
    for (auto i3 = std::size_t{}; i3 < n3; ++i3)
    {
        mat_out(i3, i1) = mat_in(i1, i3);
    }
}

Number of cache-references: (read) n*n + (write) n*n
Number of cache-misses: (read) n*n / c + (write) n*n
Rate of misses: 62.5 %.

Sure enough, I'm getting the same output as per cachegrind:

==21470== Cachegrind, a cache and branch-prediction profiler
==21470== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==21470== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==21470== Command: ./benchmark/benchmarking_transpose_vslices_dir2_naive 500
==21470== 
--21470-- warning: L3 cache found, using its data for the LL simulation.
--21470-- warning: specified LL cache: line_size 64  assoc 12  total_size 9,437,184
--21470-- warning: simulated LL cache: line_size 64  assoc 18  total_size 9,437,184
==21470== 
==21470== I   refs:      30,130,879,636
==21470== I1  misses:             7,666
==21470== LLi misses:             6,286
==21470== I1  miss rate:           0.00%
==21470== LLi miss rate:           0.00%
==21470== 
==21470== D   refs:      13,285,386,487  (6,705,198,115 rd   + 6,580,188,372 wr)
==21470== D1  misses:     8,177,337,186  (1,626,402,679 rd   + 6,550,934,507 wr)
==21470== LLd misses:     3,301,064,720  (1,625,156,375 rd   + 1,675,908,345 wr)
==21470== D1  miss rate:           61.6% (         24.3%     +          99.6%  )
==21470== LLd miss rate:           24.8% (         24.2%     +          25.5%  )
==21470== 
==21470== LL refs:        8,177,344,852  (1,626,410,345 rd   + 6,550,934,507 wr)
==21470== LL misses:      3,301,071,006  (1,625,162,661 rd   + 1,675,908,345 wr)
==21470== LL miss rate:             7.6% (          4.4%     +          25.5%  )

Now for the implementation with blocking, I expect, enter image description here

Hint: The following code is without remainder loops. The container intermediate_result, sized b x b, as per suggestion by @JérômeRichard, is used in order to prevent cache-thrashing.

for (auto bi1 = std::size_t{}; bi1 < n1; bi1 += block_size)
{
    for (auto bi3 = std::size_t{}; bi3 < n3; bi3 += block_size)
    {
        for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
        {
            for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
            {
                intermediate_result(i3, i1) = mat_in(bi1 + i1, bi3 + i3);
            }
        }

        for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
        {
            #pragma omp simd safelen(8)
            for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
            {
                mat_out(bi3 + i1, bi1 + i3) = intermediate_result(i1, i3);
            }
        }
    }
}

Number of cache-references: (read) b*b + (write) b*b
Number of cache-misses: (read) b*b / c + (write) b*b / c
Rate of misses: 25 %.

Once again, cachegrind gives me the following report:

==21473== Cachegrind, a cache and branch-prediction profiler
==21473== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==21473== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==21473== Command: ./benchmark/benchmarking_transpose_vslices_dir2_best 500 4
==21473== 
--21473-- warning: L3 cache found, using its data for the LL simulation.
--21473-- warning: specified LL cache: line_size 64  assoc 12  total_size 9,437,184
--21473-- warning: simulated LL cache: line_size 64  assoc 18  total_size 9,437,184
==21473== 
==21473== I   refs:      157,135,137,350
==21473== I1  misses:             11,057
==21473== LLi misses:              9,604
==21473== I1  miss rate:            0.00%
==21473== LLi miss rate:            0.00%
==21473== 
==21473== D   refs:       43,995,141,079  (29,709,076,051 rd   + 14,286,065,028 wr)
==21473== D1  misses:      3,307,834,114  ( 1,631,898,173 rd   +  1,675,935,941 wr)
==21473== LLd misses:      3,301,066,570  ( 1,625,157,620 rd   +  1,675,908,950 wr)
==21473== D1  miss rate:             7.5% (           5.5%     +           11.7%  )
==21473== LLd miss rate:             7.5% (           5.5%     +           11.7%  )
==21473== 
==21473== LL refs:         3,307,845,171  ( 1,631,909,230 rd   +  1,675,935,941 wr)
==21473== LL misses:       3,301,076,174  ( 1,625,167,224 rd   +  1,675,908,950 wr)
==21473== LL miss rate:              1.6% (           0.9%     +           11.7%  )

I cannot explain this discrepancy at this point, except to speculate that this might be because of prefetching.

Now, when I watch the same naive implementation using perf (with option "-d"), I get:

Performance counter stats for './benchmark/benchmarking_transpose_vslices_dir2_naive 500':

         91.122,33 msec task-clock                #    0,933 CPUs utilized          
           870.939      context-switches          #    0,010 M/sec                  
                17      cpu-migrations            #    0,000 K/sec                  
        50.807.083      page-faults               #    0,558 M/sec                  
   354.169.268.894      cycles                    #    3,887 GHz                    
   217.031.159.494      instructions              #    0,61  insn per cycle         
    34.980.334.095      branches                  #  383,883 M/sec                  
       148.578.378      branch-misses             #    0,42% of all branches        
    58.473.530.591      L1-dcache-loads           #  641,704 M/sec                  
    12.636.479.302      L1-dcache-load-misses     #   21,61% of all L1-dcache hits  
       440.543.654      LLC-loads                 #    4,835 M/sec                  
       276.733.102      LLC-load-misses           #   62,82% of all LL-cache hits   

      97,705649040 seconds time elapsed

      45,526653000 seconds user
      47,295247000 seconds sys

When I do the same for the implementation with 2D-blocking, I get:

Performance counter stats for './benchmark/benchmarking_transpose_vslices_dir2_best 500 4':

         79.865,16 msec task-clock                #    0,932 CPUs utilized          
           766.200      context-switches          #    0,010 M/sec                  
                12      cpu-migrations            #    0,000 K/sec                  
        50.807.088      page-faults               #    0,636 M/sec                  
   310.452.015.452      cycles                    #    3,887 GHz                    
   343.399.743.845      instructions              #    1,11  insn per cycle         
    51.889.725.247      branches                  #  649,717 M/sec                  
       133.541.902      branch-misses             #    0,26% of all branches        
    81.279.037.114      L1-dcache-loads           # 1017,703 M/sec                  
     7.722.318.725      L1-dcache-load-misses     #    9,50% of all L1-dcache hits  
       399.149.174      LLC-loads                 #    4,998 M/sec                  
       123.134.807      LLC-load-misses           #   30,85% of all LL-cache hits   

      85,660207381 seconds time elapsed

      34,524170000 seconds user
      46,884443000 seconds sys

Questions:

  1. Why is there a strong difference in the output here for L1D and LLC?
  2. Why are we seeing such bad L3 cache-miss rate (according to perf) in case of the blocking algorithm? This is obviously exacerbated when I start using 6 cores.

Any tips on how to detect cache-thrashing will also be appreciated.

Thanks in advance for your time and help, I'm glad to provide additional information upon request.

Additional Info:

The processor used for testing here is the (Coffee Lake) Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz.

  1. CPU with 6 cores operating at 2.80 GHz - 4.00 GHz
  2. L1 6x 32 KiB 8-way set associative (64 sets)
  3. L2 6x 256 KiB 4-way set associative (1024 sets)
  4. shared L3 9 MiB 12-way set associative (12288 sets)
0

There are 0 answers