I understand that in symmetric multiprocessor (SMP) systems, false sharing may occur due to the individual caches in each cores, for the following code: http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
01 double sum=0.0, sum_local[NUM_THREADS];
02 #pragma omp parallel num_threads(NUM_THREADS)
03 {
04 int me = omp_get_thread_num();
05 sum_local[me] = 0.0;
06
07 #pragma omp for
08 for (i = 0; i < N; i++)
09 sum_local[me] += x[i] * y[i];
10
11 #pragma omp atomic
12 sum += sum_local[me];
13 }
So my questions are:
- False sharing mainly comes from fact that memory are accessed via a chunk consisting of a fixed number of bytes, and that each process has its own cache. This chunk of memory are written and read together. Is my understanding close to the fact?
- What about the memory access pattern in GPU? Do all the stream processors share one set of cache or have separate ones? Does false sharing also a concern in GPU computing?
It's not that the cache line is read or written together, it is that if any CPU writes any byte it invalidates the whole cache line.
Depends on the GPU. At least for some NVidia GPUs, L1 cache (of the global memory) is NOT coherent, so you have other problems. With L1 disabled you can have the problem in L2 cache which is coherent.