Greetings dear earthlings!
This is the overview:
I am doing a sparse matrix multiplication where I have two dense matrices, but I am only interested in some elements of the output. I have the two matrices and one 2d index array:
float * A:[M*K]
float * B:[N*K]
int [][] idx: [M][nSelection] (that is there are nselection rows in B, per row in matrix A that I want to multiply)
and I want to compute:
out=A*B' // but of course only for the indices i,j where idx[i][k]=j for some k.
This is the grit of the matter: I want to keep the stuff in cpu cache while I am multiplying I multiply 16 floats at a time:
for (int kk = 0; kk < K; kk+=KK){
int begin = 0;
int end = M;
for (int i = begin; i < end; i++){
for (int j = 0; j < numberToAccept; j++){
int theid = idx[i * numberToAccept + j];
tempOut[i*numberToAccept+j] += blas_function_for_dotProduct(KK, A+i*K + kk, 1, B+theid*K + kk, 1);
}
}
}
The numbers I am working with are:
M = 2048; N=10240; K=4096; nSelection=100;
So the idea is that if I parse the matrices in little parts by little parts (in the direction of KK in chunks of 16 floats = 1 cacheline) I will be able to load both matrices only once. That is, matrix A is loaded sequentially, so other than the first load of each row, all other loads should be hits. matrix B is loaded in a random order, but because I process it the whole thing in chunks of 64 bytes (16 floats) then it should require 640KB. I have 6 MB L1 so it should stay there if the CPU uses LRU.
I use valgrind to look at the cache misses and this is what I get:
==3264== D refs: 2,383,524,642 (2,272,927,073 rd + 110,597,569 wr)
==3264== D1 misses: 114,096,428 ( 113,982,278 rd + 114,150 wr)
==3264== LLd misses: 95,822,173 ( 95,736,938 rd + 85,235 wr)
==3264== D1 miss rate: 4.7% ( 5.0% + 0.1% )
==3264== LLd miss rate: 4.0% ( 4.2% + 0.0% )
Also I get the same result if the access is predictable (counting up).
I get the same result with N=512. but at N=256 I suddenly get cache hits (random access):
==16546== D refs: 2,383,525,914 (2,272,928,002 rd + 110,597,912 wr)
==16546== D1 misses: 114,096,557 ( 113,982,392 rd + 114,165 wr)
==16546== LLd misses: 1,372,862 ( 1,287,624 rd + 85,238 wr)
==16546== D1 miss rate: 4.7% ( 5.0% + 0.1% )
==16546== LLd miss rate: 0.0% ( 0.0% + 0.0% )
The thing is that I have 6MB cache on my mobile core i7. And 512 * 16 *size of float is 23 KB. So why am I getting so much misses? ( I wrote most of this last night, today I found the answer.
Its the associativity of the cache. My l3 is 12 way associative, I haven't done the calculations, but changing K to 4*1024+16 gives me hits all the way!