Does _mm_clflush really flush the cache?

5.1k views Asked by At

I'm trying to understand how the hardware cache works by writing and running a test program:

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>

#define LINE_SIZE   64

#define L1_WAYS     8
#define L1_SETS     64
#define L1_LINES    512

// 32K memory for filling in L1 cache
uint8_t data[L1_LINES*LINE_SIZE];

int main()
{
    volatile uint8_t *addr;
    register uint64_t i;
    int junk = 0;
    register uint64_t t1, t2;

    printf("data: %p\n", data);

    //_mm_clflush(data);
    printf("accessing 16 bytes in a cache line:\n");
    for (i = 0; i < 16; i++) {
        t1 = __rdtscp(&junk);
        addr = &data[i];
        junk = *addr;
        t2 = __rdtscp(&junk) - t1;
        printf("i = %2d, cycles: %ld\n", i, t2);
    }
}

I run the code with and w/o the _mm_clflush, while the results just show with _mm_clflush the first memory access is faster.

with _mm_clflush:

$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i =  0, cycles: 280
i =  1, cycles: 84
i =  2, cycles: 91
i =  3, cycles: 77
i =  4, cycles: 91

w/o _mm_clflush:

$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i =  0, cycles: 3899
i =  1, cycles: 91
i =  2, cycles: 105
i =  3, cycles: 77
i =  4, cycles: 84

It just does not make sense you flush the cache line, but actually get faster? Could anyone explain why this happens? Thanks

----------------Further experiment-------------------

Let's assume the 3899 cycles is caused by TLB miss. To prove my knowledge of cache hit/miss, I slightly modified this code to compare the memory access time in case of L1 cache hit and L1 cache miss.

This time, the code skips the cache line size (64 bytes) and accesses the next memory address.

*data = 1;
_mm_clflush(data);
printf("accessing 16 bytes in a cache line:\n");
for (i = 0; i < 16; i++) {
    t1 = __rdtscp(&junk);
    addr = &data[i];
    junk = *addr;
    t2 = __rdtscp(&junk) - t1;
    printf("i = %2d, cycles: %ld\n", i, t2);
}

// Invalidate and flush the cache line that contains p from all levels of the cache hierarchy.
_mm_clflush(data);
printf("accessing 16 bytes in different cache lines:\n");
for (i = 0; i < 16; i++) {
    t1 = __rdtscp(&junk);
    addr = &data[i*LINE_SIZE];
    junk = *addr;
    t2 = __rdtscp(&junk) - t1;
    printf("i = %2d, cycles: %ld\n", i, t2);
}

Since my computer has an 8-way set associate L1 data cache, with 64 sets, totally 32KB. If I access memory every 64 bytes, it should cause all the cache misses. But it seems there are a lot of cache lines that have already cached:

$ ./l1
data: 0x700c00
accessing 16 bytes in a cache line:
i =  0, cycles: 273
i =  1, cycles: 70
i =  2, cycles: 70
i =  3, cycles: 70
i =  4, cycles: 70
i =  5, cycles: 70
i =  6, cycles: 70
i =  7, cycles: 70
i =  8, cycles: 70
i =  9, cycles: 70
i = 10, cycles: 77
i = 11, cycles: 70
i = 12, cycles: 70
i = 13, cycles: 70
i = 14, cycles: 70
i = 15, cycles: 140
accessing 16 bytes in different cache lines:
i =  0, cycles: 301
i =  1, cycles: 133
i =  2, cycles: 70
i =  3, cycles: 70
i =  4, cycles: 147
i =  5, cycles: 56
i =  6, cycles: 70
i =  7, cycles: 63
i =  8, cycles: 70
i =  9, cycles: 63
i = 10, cycles: 70
i = 11, cycles: 112
i = 12, cycles: 147
i = 13, cycles: 119
i = 14, cycles: 56
i = 15, cycles: 105

Is this caused by the prefetch? Or is there anything wrong with my understanding? Thanks

2

There are 2 answers

1
Luo Yongping On

I modified the code by adding a write before _mm_clflush(data) and it shows clflush does flush the cache line. The modified code:

#include <stdio.h>
#include <stdint.h>
#include <x86intrin.h>

#define LINE_SIZE   64
#define L1_LINES    512

// 32K memory for filling in L1 cache
uint8_t data[L1_LINES*LINE_SIZE];

int main()
{
    volatile uint8_t *addr;
    register uint64_t i;
    unsigned int junk = 0;
    register uint64_t t1, t2;

    data[0] = 1; //write before cflush
    //_mm_clflush(data);

    printf("accessing 16 bytes in a cache line:\n");
    for (i = 0; i < 16; i++) {
        t1 = __rdtscp(&junk);
        addr = &data[i];
        junk = *addr;
        t2 = __rdtscp(&junk) - t1;
        printf("i = %2d, cycles: %ld\n", i, t2);
    }
}

I ran the modified code on my computer(Intel(R) Core(TM) i5-8500 CPU) and got result below. The first access to data which flushed to memory before have a apparently higher latency then that who do not, according to several tries.

without clflush:

data: 0000000000407980
accessing 16 bytes in a cache line:
i =  0, cycles: 64
i =  1, cycles: 46
i =  2, cycles: 49
i =  3, cycles: 48
i =  4, cycles: 46

with clflush:

data: 0000000000407980
accessing 16 bytes in a cache line:
i =  0, cycles: 214
i =  1, cycles: 41
i =  2, cycles: 40
i =  3, cycles: 42
i =  4, cycles: 40
0
Hadi Brais On

Without clflush, the first load takes about 3899 cycles, which is about the amount of time it takes to handle a minor page fault. rdtscp serializes the load operations, thereby ensuring that all later loads to the same line hit in the L1 cache. Now when you add clflush just before the loop, the page fault is triggered and handled outside the loop. When the page fault handler returns and clflush gets re-executed, the target cache line gets flushed. On Intel processors, rdtscp ensures that the line is flushed before the first load in the loop is issued. Therefore, the first load misses in the cash hierarchy and its latency would be about that of a memory access. Just like the previous case, later loads are serialized by rdtscp and so they all hit in the L1D.

The measured L1D hit latencies are too high, though, even if we consider the overhead of rdtscp. Did you compile with -O3?

I wasn't able to reproduce your results (i.e., the minor page fault) with gcc 5.5.0 on Linux 4.4.0-154 when the cache line is allocated statically, but only when I use mmap. If you tell me your compiler version and kernel version, maybe I can investigate this further.

Regarding your second question, the way you're measuring load latency will not enable you to distinguish between an L1D hit and an L2 hit because the error in measurement could be as large as the difference in latencies. You can use the MEM_LOAD_UOPS_RETIRED.L1_HIT and MEM_LOAD_UOPS_RETIRED.L2_HIT performance counters to check. A sequential access pattern is pretty easily detectable by the L1 and L2 hardware prefetchers, so it's not surprising to get hits if you don't switch the prefetchers off.