Lots of cache miss, Sparse matrix multiplication

295 views Asked by At

Greetings dear earthlings!

This is the overview:

I am doing a sparse matrix multiplication where I have two dense matrices, but I am only interested in some elements of the output. I have the two matrices and one 2d index array:

float * A:[M*K]
float * B:[N*K]
int [][] idx: [M][nSelection] (that is there are nselection rows in B, per row in matrix A that I want to multiply)

and I want to compute:

out=A*B' // but of course only for the indices i,j where idx[i][k]=j for some k.

This is the grit of the matter: I want to keep the stuff in cpu cache while I am multiplying I multiply 16 floats at a time:

for (int kk = 0; kk < K; kk+=KK){
    int begin = 0;
    int end = M;
        for (int i = begin; i < end; i++){
            for (int j = 0; j < numberToAccept; j++){
                int theid = idx[i * numberToAccept + j];
                tempOut[i*numberToAccept+j] += blas_function_for_dotProduct(KK, A+i*K + kk, 1, B+theid*K + kk, 1);
            }
        }
    }

The numbers I am working with are:

M = 2048; N=10240; K=4096; nSelection=100;

So the idea is that if I parse the matrices in little parts by little parts (in the direction of KK in chunks of 16 floats = 1 cacheline) I will be able to load both matrices only once. That is, matrix A is loaded sequentially, so other than the first load of each row, all other loads should be hits. matrix B is loaded in a random order, but because I process it the whole thing in chunks of 64 bytes (16 floats) then it should require 640KB. I have 6 MB L1 so it should stay there if the CPU uses LRU.

I use valgrind to look at the cache misses and this is what I get:

==3264== D   refs:      2,383,524,642  (2,272,927,073 rd   + 110,597,569 wr)
==3264== D1  misses:      114,096,428  (  113,982,278 rd   +     114,150 wr)
==3264== LLd misses:       95,822,173  (   95,736,938 rd   +      85,235 wr)
==3264== D1  miss rate:           4.7% (          5.0%     +         0.1%  )
==3264== LLd miss rate:           4.0% (          4.2%     +         0.0%  )

Also I get the same result if the access is predictable (counting up).

I get the same result with N=512. but at N=256 I suddenly get cache hits (random access):

==16546== D   refs:      2,383,525,914  (2,272,928,002 rd   + 110,597,912 wr)
==16546== D1  misses:      114,096,557  (  113,982,392 rd   +     114,165 wr)
==16546== LLd misses:        1,372,862  (    1,287,624 rd   +      85,238 wr)
==16546== D1  miss rate:           4.7% (          5.0%     +         0.1%  )
==16546== LLd miss rate:           0.0% (          0.0%     +         0.0%  )

The thing is that I have 6MB cache on my mobile core i7. And 512 * 16 *size of float is 23 KB. So why am I getting so much misses? ( I wrote most of this last night, today I found the answer.

1

There are 1 answers

0
amir On BEST ANSWER

Its the associativity of the cache. My l3 is 12 way associative, I haven't done the calculations, but changing K to 4*1024+16 gives me hits all the way!