I coded the following toy benchmark.
int N = 1024*4096;
unsigned char *ary = malloc(N);
ary[0] = 1;
int stride, i;
double start, end;
int sum;
for(stride = 1; stride < N; ++stride) {
start = getCPUTime();
sum = 0;
for(i = 0; i < N; i+=stride) {
sum += ary[i];
}
end = getCPUTime();
printf("stride %d time %f sum %d\n", stride, (end - start)/(N/stride), sum);
}
Basically, it iterates through an array in different strides. I then plotted the results:
(Results are smoothed)
When stride is ~128, the CPU can fit all of the data to be accessed in L1 Cache. Given the linearity of accesses future reads are presumably predicted.
My question is, why does the average time for read keep rising after that? My reasoning for stride=~128 also holds for values greater than that.
Thanks!
Is that the code you used? All it does is read data from 16 MB. I ran it on my PC, where 16 MB is from RAM, calculating MB/second, which was 993 at stride 2, reducing to 880 at stride 999. Based on measuring microseconds running time, your time calculation produced 0.0040 at stride 2, increasing 0.0045 at stride 999.
There are all sorts of reasons for speed reductions at increased strides like, burst reading, cache alignment and different memory banks.