I'm trying to understand how the hardware cache works by writing and running a test program:
#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>
#define LINE_SIZE 64
#define L1_WAYS 8
#define L1_SETS 64
#define L1_LINES 512
// 32K memory for filling in L1 cache
uint8_t data[L1_LINES*LINE_SIZE];
int main()
{
volatile uint8_t *addr;
register uint64_t i;
int junk = 0;
register uint64_t t1, t2;
printf("data: %p\n", data);
//_mm_clflush(data);
printf("accessing 16 bytes in a cache line:\n");
for (i = 0; i < 16; i++) {
t1 = __rdtscp(&junk);
addr = &data[i];
junk = *addr;
t2 = __rdtscp(&junk) - t1;
printf("i = %2d, cycles: %ld\n", i, t2);
}
}
I run the code with and w/o the _mm_clflush
, while the results just show with _mm_clflush
the first memory access is faster.
with _mm_clflush
:
$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i = 0, cycles: 280
i = 1, cycles: 84
i = 2, cycles: 91
i = 3, cycles: 77
i = 4, cycles: 91
w/o _mm_clflush
:
$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i = 0, cycles: 3899
i = 1, cycles: 91
i = 2, cycles: 105
i = 3, cycles: 77
i = 4, cycles: 84
It just does not make sense you flush the cache line, but actually get faster? Could anyone explain why this happens? Thanks
----------------Further experiment-------------------
Let's assume the 3899 cycles is caused by TLB miss. To prove my knowledge of cache hit/miss, I slightly modified this code to compare the memory access time in case of L1 cache hit
and L1 cache miss
.
This time, the code skips the cache line size (64 bytes) and accesses the next memory address.
*data = 1;
_mm_clflush(data);
printf("accessing 16 bytes in a cache line:\n");
for (i = 0; i < 16; i++) {
t1 = __rdtscp(&junk);
addr = &data[i];
junk = *addr;
t2 = __rdtscp(&junk) - t1;
printf("i = %2d, cycles: %ld\n", i, t2);
}
// Invalidate and flush the cache line that contains p from all levels of the cache hierarchy.
_mm_clflush(data);
printf("accessing 16 bytes in different cache lines:\n");
for (i = 0; i < 16; i++) {
t1 = __rdtscp(&junk);
addr = &data[i*LINE_SIZE];
junk = *addr;
t2 = __rdtscp(&junk) - t1;
printf("i = %2d, cycles: %ld\n", i, t2);
}
Since my computer has an 8-way set associate L1 data cache, with 64 sets, totally 32KB. If I access memory every 64 bytes, it should cause all the cache misses. But it seems there are a lot of cache lines that have already cached:
$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i = 0, cycles: 273
i = 1, cycles: 70
i = 2, cycles: 70
i = 3, cycles: 70
i = 4, cycles: 70
i = 5, cycles: 70
i = 6, cycles: 70
i = 7, cycles: 70
i = 8, cycles: 70
i = 9, cycles: 70
i = 10, cycles: 77
i = 11, cycles: 70
i = 12, cycles: 70
i = 13, cycles: 70
i = 14, cycles: 70
i = 15, cycles: 140
accessing 16 bytes in different cache lines:
i = 0, cycles: 301
i = 1, cycles: 133
i = 2, cycles: 70
i = 3, cycles: 70
i = 4, cycles: 147
i = 5, cycles: 56
i = 6, cycles: 70
i = 7, cycles: 63
i = 8, cycles: 70
i = 9, cycles: 63
i = 10, cycles: 70
i = 11, cycles: 112
i = 12, cycles: 147
i = 13, cycles: 119
i = 14, cycles: 56
i = 15, cycles: 105
Is this caused by the prefetch? Or is there anything wrong with my understanding? Thanks
I modified the code by adding a write before
_mm_clflush(data)
and it shows clflush does flush the cache line. The modified code:I ran the modified code on my computer(Intel(R) Core(TM) i5-8500 CPU) and got result below. The first access to data which flushed to memory before have a apparently higher latency then that who do not, according to several tries.
without clflush:
with clflush: