I am learning cache operations with regard to spatial locality. (My references so far are Principles of Parallel Programming by Lin and Snyder, this tutorial, and of course Wikipedia.)
Take the following example, compiled with gcc, running on Windows 7 Professional, using Intel Core2 Duo CPU (L7500).
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main()
{
int *array;
int length;
int count;
int range;
int i;
// generate an array of a million integers between 0 and 99
length = 1000000;
range = 100;
array = calloc(length, sizeof(int));
srand(time(NULL));
for(i = 0; i < length; i++)
{
array[i] = rand() % range;
// printf("%d\n", array[i]);
}
// count the number of occurrences of 3 in the array
count=0;
for(i=0; i<length; i++)
{
if(array[i]==3)
{
count++;
}
}
printf("count = %6d\n", count);
return 0;
}
Now in the second half of the routine, the entire array of integers is going to be read, so per spatial locality the CPU should load them into the cache in advance. But how much of the array can/does/should it load into the cache at any one time during the loop? One cache line at a time (64 bytes / 4 bytes per int = 16 integers), large blocks of it, or the entire array in one fell swoop?
Also, from what I understand, the latency involved in loading data from RAM into the cache (or per the textbook, from non-local to local memory) can be much more significant than the time required to actually run the routine. True?
Now say we move this code to a multiprocessor/multicore machine, and the counting portion of the code is changed to run in 4, 8, 16, etc. parallel threads (using pthreads), counting separate portions of the array, then adding the private counts together at the end. Could this cause multiple separate occurrences of RAM-to-cache latency, making the parallel version run slower than the serial one?
Yes, memory speed and latency do dominate many algorithms and it is necessary to use the memory cache as efficiently as possible to speed these up.
Running in parallel can hurt your performance, but not usually. Figuring this out requires a lot of testing and tweaking.
For example, take a quad core chip connected to one bank of RAM. If the algorithm requires maximum speed memory reads and the computation is always faster than the RAM speed, running in parallel will not gain anything and will probably slow things down.
But if you have a dual socket system, each CPU will have its own RAM and the algorithm will speed up.
Or, the system might upgrade from 1 bank of RAM to 4 and switch from a single channel to quad-channel RAM configuration. At that point, RAM speed might exceed the computation speed and the quad core will see benefit from running more threads.
In my opinion, running a thread per core will usually benefit you and will take advantage of system upgrades. Running a single thread may avoid a small amount of synchronization overhead, but will always limit the program in the future.