Transposing global 2D Square Matrix/Array of size 1 gb with tiling approach(Cache Aware) has no performance gain in single threaded execution over Normal transpose method. Not discussing the transpose speed up using AVX,SSE(SIMD) or any other cache oblivious transpose algorithm(http://supertech.csail.mit.edu/papers/FrigoLePr12.pdf)
#include <stdio.h>
#include <sys/time.h>
#define SIZE 16384
float a[SIZE][SIZE], b[SIZE][SIZE];
void testNormalTranspose() {
int i, j, k, l;
b[0][9999] = 1.0;
for (i=0; i<SIZE; i++)
for (j=0; j<SIZE; j++)
a[i][j] = b[j][i];
}
void testTiledTranspose(){
int i, j, k, l;
b[0][9999] = 1.0;
int blocksize = 16;
for (i=0; i<SIZE; i+= blocksize) {
for (j=0; j<SIZE; j+=blocksize) {
for (int ii = i;ii <i + blocksize; ++ii) {
for (int jj = j; jj < j + blocksize; ++jj) {
a[ii][jj] = b[jj][ii];
}
}
}
}
}
int main()
{
struct timeval t1, t2;
/*
gettimeofday(&t1, NULL);
testNormalTranspose();
gettimeofday(&t2, NULL);
printf("Time for the Normal transpose is %ld milliseconds\n",
(t2.tv_sec - t1.tv_sec)*1000 +
(t2.tv_usec - t1.tv_usec) / 1000);
*/
gettimeofday(&t1, NULL);
testTiledTranspose();
gettimeofday(&t2, NULL);
printf("Time for the Tiled transpose is %ld milliseconds\n",
(t2.tv_sec - t1.tv_sec)*1000 +
(t2.tv_usec - t1.tv_usec) / 1000);
printf("%f\n", a[9999][0]);
}
Loop tiling helps in case the data is being reused. If you use an element SIZE times, you better use it SIZE times and only then proceed to a next element.
Unfortunately, transposing 2D matrix you are not reusing any elements neither of matrix a, nor b. Even more, since in the loop you mix rows and cols access (i.e. a[i][j] = b[j][i]), you will never get unit-stride memory access on both a and b arrays at the same time, but only on one of them.
So, in this case tiling is not that efficient, but still you might have some performance improvements even with "random" memory access if:
So, to see any improvements the memory footprint of this "random" accesses must fit into the cache of your system. Basically this means you have to carefully chose the
blocksize
and 16 you have chosen in the example might work better on one system and worse on the other.Here are the results from my computer for different power of 2 block sizes and
SIZE 4096
:As you can see the version with
blocksize
8 works the best for me and it almost double the performance.Here are the results for
SIZE 4131
and power of 3 block sizes:Regarding 16384 size issue. I can not reproduce it, i.e. I still see the same gain for big matrix. Just please note, that 16384 * 16384 * sizeof(float) makes 4GB, which might expose some system issues...