In the following question, we're talking about an algorithm which transposes a matrix of complex values struct complex {double real = 0.0; double imag = 0.0;};
. Owing to a special data-layout, there is a stride-n*n
access between the rows, which means that the loading of a subsequent row causes the eviction of the previously loaded row. All runs have been done using 1 thread only.
I'm trying to understand why my 'optimized' transpose function, which makes use of 2D blocking, is performing badly (coming from: 2D blocking with unique matrix transpose problem) and so I'm trying to use performance counters/cache simulators to get a reading on what's going wrong.
According to my analysis, if n=500
is the size of the matrix, b=4
is my block-size and c=4
is my cache-line size, we have for the naive algorithm,
for (auto i1 = std::size_t{}; i1 < n1; ++i1)
{
for (auto i3 = std::size_t{}; i3 < n3; ++i3)
{
mat_out(i3, i1) = mat_in(i1, i3);
}
}
Number of cache-references: (read) n*n
+ (write) n*n
Number of cache-misses: (read) n*n / c
+ (write) n*n
Rate of misses: 62.5 %.
Sure enough, I'm getting the same output as per cachegrind:
==21470== Cachegrind, a cache and branch-prediction profiler
==21470== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==21470== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==21470== Command: ./benchmark/benchmarking_transpose_vslices_dir2_naive 500
==21470==
--21470-- warning: L3 cache found, using its data for the LL simulation.
--21470-- warning: specified LL cache: line_size 64 assoc 12 total_size 9,437,184
--21470-- warning: simulated LL cache: line_size 64 assoc 18 total_size 9,437,184
==21470==
==21470== I refs: 30,130,879,636
==21470== I1 misses: 7,666
==21470== LLi misses: 6,286
==21470== I1 miss rate: 0.00%
==21470== LLi miss rate: 0.00%
==21470==
==21470== D refs: 13,285,386,487 (6,705,198,115 rd + 6,580,188,372 wr)
==21470== D1 misses: 8,177,337,186 (1,626,402,679 rd + 6,550,934,507 wr)
==21470== LLd misses: 3,301,064,720 (1,625,156,375 rd + 1,675,908,345 wr)
==21470== D1 miss rate: 61.6% ( 24.3% + 99.6% )
==21470== LLd miss rate: 24.8% ( 24.2% + 25.5% )
==21470==
==21470== LL refs: 8,177,344,852 (1,626,410,345 rd + 6,550,934,507 wr)
==21470== LL misses: 3,301,071,006 (1,625,162,661 rd + 1,675,908,345 wr)
==21470== LL miss rate: 7.6% ( 4.4% + 25.5% )
Now for the implementation with blocking, I expect,
Hint: The following code is without remainder loops. The container intermediate_result
, sized b
x b
, as per suggestion by @JérômeRichard, is used in order to prevent cache-thrashing.
for (auto bi1 = std::size_t{}; bi1 < n1; bi1 += block_size)
{
for (auto bi3 = std::size_t{}; bi3 < n3; bi3 += block_size)
{
for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
{
for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
{
intermediate_result(i3, i1) = mat_in(bi1 + i1, bi3 + i3);
}
}
for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
{
#pragma omp simd safelen(8)
for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
{
mat_out(bi3 + i1, bi1 + i3) = intermediate_result(i1, i3);
}
}
}
}
Number of cache-references: (read) b*b
+ (write) b*b
Number of cache-misses: (read) b*b / c
+ (write) b*b / c
Rate of misses: 25 %.
Once again, cachegrind gives me the following report:
==21473== Cachegrind, a cache and branch-prediction profiler
==21473== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==21473== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==21473== Command: ./benchmark/benchmarking_transpose_vslices_dir2_best 500 4
==21473==
--21473-- warning: L3 cache found, using its data for the LL simulation.
--21473-- warning: specified LL cache: line_size 64 assoc 12 total_size 9,437,184
--21473-- warning: simulated LL cache: line_size 64 assoc 18 total_size 9,437,184
==21473==
==21473== I refs: 157,135,137,350
==21473== I1 misses: 11,057
==21473== LLi misses: 9,604
==21473== I1 miss rate: 0.00%
==21473== LLi miss rate: 0.00%
==21473==
==21473== D refs: 43,995,141,079 (29,709,076,051 rd + 14,286,065,028 wr)
==21473== D1 misses: 3,307,834,114 ( 1,631,898,173 rd + 1,675,935,941 wr)
==21473== LLd misses: 3,301,066,570 ( 1,625,157,620 rd + 1,675,908,950 wr)
==21473== D1 miss rate: 7.5% ( 5.5% + 11.7% )
==21473== LLd miss rate: 7.5% ( 5.5% + 11.7% )
==21473==
==21473== LL refs: 3,307,845,171 ( 1,631,909,230 rd + 1,675,935,941 wr)
==21473== LL misses: 3,301,076,174 ( 1,625,167,224 rd + 1,675,908,950 wr)
==21473== LL miss rate: 1.6% ( 0.9% + 11.7% )
I cannot explain this discrepancy at this point, except to speculate that this might be because of prefetching.
Now, when I watch the same naive implementation using perf (with option "-d"), I get:
Performance counter stats for './benchmark/benchmarking_transpose_vslices_dir2_naive 500':
91.122,33 msec task-clock # 0,933 CPUs utilized
870.939 context-switches # 0,010 M/sec
17 cpu-migrations # 0,000 K/sec
50.807.083 page-faults # 0,558 M/sec
354.169.268.894 cycles # 3,887 GHz
217.031.159.494 instructions # 0,61 insn per cycle
34.980.334.095 branches # 383,883 M/sec
148.578.378 branch-misses # 0,42% of all branches
58.473.530.591 L1-dcache-loads # 641,704 M/sec
12.636.479.302 L1-dcache-load-misses # 21,61% of all L1-dcache hits
440.543.654 LLC-loads # 4,835 M/sec
276.733.102 LLC-load-misses # 62,82% of all LL-cache hits
97,705649040 seconds time elapsed
45,526653000 seconds user
47,295247000 seconds sys
When I do the same for the implementation with 2D-blocking, I get:
Performance counter stats for './benchmark/benchmarking_transpose_vslices_dir2_best 500 4':
79.865,16 msec task-clock # 0,932 CPUs utilized
766.200 context-switches # 0,010 M/sec
12 cpu-migrations # 0,000 K/sec
50.807.088 page-faults # 0,636 M/sec
310.452.015.452 cycles # 3,887 GHz
343.399.743.845 instructions # 1,11 insn per cycle
51.889.725.247 branches # 649,717 M/sec
133.541.902 branch-misses # 0,26% of all branches
81.279.037.114 L1-dcache-loads # 1017,703 M/sec
7.722.318.725 L1-dcache-load-misses # 9,50% of all L1-dcache hits
399.149.174 LLC-loads # 4,998 M/sec
123.134.807 LLC-load-misses # 30,85% of all LL-cache hits
85,660207381 seconds time elapsed
34,524170000 seconds user
46,884443000 seconds sys
Questions:
- Why is there a strong difference in the output here for L1D and LLC?
- Why are we seeing such bad L3 cache-miss rate (according to perf) in case of the blocking algorithm? This is obviously exacerbated when I start using 6 cores.
Any tips on how to detect cache-thrashing will also be appreciated.
Thanks in advance for your time and help, I'm glad to provide additional information upon request.
Additional Info:
The processor used for testing here is the (Coffee Lake) Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz.
- CPU with 6 cores operating at 2.80 GHz - 4.00 GHz
- L1 6x 32 KiB 8-way set associative (64 sets)
- L2 6x 256 KiB 4-way set associative (1024 sets)
- shared L3 9 MiB 12-way set associative (12288 sets)