I need to make a good implementation for matrix multiplication better than the naive method here is the methods i used : 1- removed false dependencies which made the performance a lot better 2- used a recursive approach and then there is something i need to try loop unrolling. The thing is each time i used it , it makes the performance worst i can't find an explanation for it i need help here is the code
for (i = 0; i < M; i++)
for (j = 0; j < N; j++) {
double sum = 0;
#pragma unroll(5)
for (k = 0; k < K; k++)
{
sum += A[i + k*LDA] * B[k + j*LDB];
}
C[i + j*LDC] = sum ;
}