( I know that there have been a few somewhat related questions asked in the past, but I wasn't able to find a question regarding L1d cache misses and HyperThreading/SMT. )

After reading for a couple of days about some super interesting stuff like False Sharing, MESI/MOESI Cache Coherence Protocols , I decided to write a small "benchmark" in C (see below) , to test False Sharing in action.

I basically have an array of 8 doubles so that it fits in one Cache Line and two threads incrementing adjacent array positions.

At this point I should state that I am using a Ryzen 5 3600 , whose topology can be seen here .

I create two threads and then pin them two different logical cores, and each accesses and updates it's own array position, i.e Thread A updates array[2] and Thread B updates array[3] .

When I run the code using Hardware Threads #0 and #6 that belong to the same core , that (as shown in the topology diagram) share the L1d cache , the execution time is ~ 5 seconds.

When I use Threads #0 and #11 that don't have any caches in common , it takes ~ 9.5 seconds to complete. That time difference is expected because in this case there is "Cache Line Ping-Pong" going on.

However , and this is what bugs me, when I am using Threads #0 and #11 , the L1d cache misses are less than running with Threads #0 and #6.

My guess is that when using Threads #0 and #11 that don't have common caches, when one thread updates the contents of the shared cache line , according to the MESI/MOESI Protocol, the cache line in the other core get Invalidated. So even though there's a Ping-Pong going on, there aren't that many cache misses going on (compared to when running with Threads #0 and #6) , just a bunch of invalidations and cache line blocks transfers between the cores.

So, when using Threads #0 and #6 that have a common L1d cache , why are there more cache misses ?

(Threads #0 and #6 , also have common L2 cache, but I don't think it has any importance here because when the cache line gets invalidated, it has to be fetched in either from main memory (MESI) or from another core's cache(MOESI), so it seems impossible for the L2 to even have the data needed but also to be asked for it) .

Of course, when one thread writes into the L1d cache line , the cache line gets "dirty" but why does that matter? Shouldn't the other thread that resides on the same physical core have no problem reading the new "dirty" value?

TLDR: When testing False Sharing , there's approximately 3x more L1d cache misses when using two sibling-threads (threads that belong to the same physical core) , than when using threads that belong in two different physical cores. (2.34% vs. 0.75% miss rate, 396m vs. 118m absolute number of misses). Why is that happening?

(All statistics like L1d cache misses are measured using the perf tool in Linux. )

Also, minor secondary question , why are the sibling-threads paired in IDs 6 numbers apart? i.e Thread 0's sibling is Thread 6.. Thread i's sibling is Thread i+6. Does that help in any way? I've noticed this in both Intel and AMD CPUs.

I am super interested in Computer Architecture and I am still learning, so some of the above might be wrong, so sorry for that.

So, this is my code. Just creating two threads, binding them to specific logical cores and then hitting adjacent cache line locations.

#define _GNU_SOURCE

#include <stdio.h>
#include <sched.h>
#include <stdlib.h>
#include <sys/random.h>
#include <time.h>
#include <pthread.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>

struct timespec tstart, tend;
static cpu_set_t cpuset;


typedef struct arg_s
{
       int index;
       double *array_ptr;
} arg_t;

void *work(void *arg)
{
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
    int array_index = ((arg_t*)arg)->index;
    double *ptr = ((arg_t*)arg)->array_ptr;

    for(unsigned long i=0; i<1000000000; i++)
    {
            //it doesn't matter which of these is used
            // as long we are hitting adjacent positions
            ptr[array_index] ++;
            // ptr[array_index] += 1.0e5 * 4;
    }
    return NULL;
}

int main()
{
    pthread_t tid[2];

    srand(time(NULL));
    
    static int cpu0 = 0;
    static int cpu6 = 6; //change this to say 11 to run with threads 0 and 11

    CPU_ZERO(&cpuset);
    CPU_SET(cpu0, &cpuset);
    CPU_SET(cpu6, &cpuset);

    double array[8];

    for(int i=0; i<8; i++)
            array[i] = drand48();

    arg_t *arg0 = malloc(sizeof(arg_t));
    arg_t *arg1 = malloc(sizeof(arg_t));

    arg0->index = 0; arg0->array_ptr = array;       
    arg1->index = 1; arg1->array_ptr = array;


    clock_gettime(CLOCK_REALTIME, &tstart);

    pthread_create(&tid[0], NULL, work, (void*)arg0);
    pthread_create(&tid[1], NULL, work, (void*)arg1);

    pthread_join(tid[0], NULL);
    pthread_join(tid[1], NULL);


    clock_gettime(CLOCK_REALTIME, &tend);
 }

I am using GCC 10.2.0 Compiling as gcc -pthread p.c -o p

then running perf record ./p --cpu=0,6 or the same thing with --cpu=0,11 when using threads 0,6 and 0,11 respectively.

then running perf stat -d ./p --cpu=0,6 or the same with --cpu=0,11 for the other case

Running with Threads 0 and 6 :

Performance counter stats for './p --cpu=0,6':

           9437,29 msec task-clock                #    1,997 CPUs utilized          
                64      context-switches          #    0,007 K/sec                  
                 2      cpu-migrations            #    0,000 K/sec                  
               912      page-faults               #    0,097 K/sec                  
       39569031046      cycles                    #    4,193 GHz                      (75,00%)
        5925158870      stalled-cycles-frontend   #   14,97% frontend cycles idle     (75,00%)
        2300826705      stalled-cycles-backend    #    5,81% backend cycles idle      (75,00%)
       24052237511      instructions              #    0,61  insn per cycle         
                                                  #    0,25  stalled cycles per insn  (75,00%)
        2010923861      branches                  #  213,083 M/sec                    (75,00%)
            357725      branch-misses             #    0,02% of all branches          (75,03%)
       16930828846      L1-dcache-loads           # 1794,034 M/sec                    (74,99%)
         396121055      L1-dcache-load-misses     #    2,34% of all L1-dcache accesses  (74,96%)
   <not supported>     LLC-loads                                                   
   <not supported>     LLC-load-misses                                             

       4,725786281 seconds time elapsed

       9,429749000 seconds user
       0,000000000 seconds sys 

Running with Threads 0 and 11 :

 Performance counter stats for './p --cpu=0,11':

          18693,31 msec task-clock                #    1,982 CPUs utilized          
               114      context-switches          #    0,006 K/sec                  
                 1      cpu-migrations            #    0,000 K/sec                  
               903      page-faults               #    0,048 K/sec                  
       78404951347      cycles                    #    4,194 GHz                      (74,97%)
        1763001213      stalled-cycles-frontend   #    2,25% frontend cycles idle     (74,98%)
       71054052070      stalled-cycles-backend    #   90,62% backend cycles idle      (74,98%)
       24055983565      instructions              #    0,31  insn per cycle         
                                                  #    2,95  stalled cycles per insn  (74,97%)
        2012326306      branches                  #  107,650 M/sec                    (74,96%)
            553278      branch-misses             #    0,03% of all branches          (75,07%)
       15715489973      L1-dcache-loads           #  840,701 M/sec                    (75,09%)
         118455010      L1-dcache-load-misses     #    0,75% of all L1-dcache accesses  (74,98%)
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

       9,430223356 seconds time elapsed

      18,675328000 seconds user
       0,000000000 seconds sys
1

There are 1 answers

6
rjhcnf On

Perf does not support AMD yet for detail analysis of cache behavior (It partially does from Kernel 6.1). You can use uprof from amd to investigate more in detail. https://www.amd.com/en/developer/uprof.html