I am looking for a solution to accelerate the performances of my program with a lot of matrix multiplications. So I hace replaced the CLAPACK f2c libraries with the MKL. Unfortunately, the performances results was not the expected ones.
After investigation, I faced to a block triangular matrix which gives bad performances principaly when i try to multiply it with its transpose.
In order to simplify the problem I did my tests with an identity matrix of 5000 elements ( I found the same comportment )
NAME | Matrix [Size,Size] | CLAPACK f2c(second) | MKL_GNU_THREAD (second) |
---|---|---|---|
Multiplication of an identity matrix by itself | 5000 | 0.076536 | 1.090167 |
Multiplication of dense matrix by its transpose | 5000*5000 | 93.71569 | 1.113872 |
- We can see that the CLAPACK f2c multiplication of an identity matrix is faster ( x14) than the MKL.
- We can note an acceleration multipliy by 84 between the MKL and CLAPACK f2c dense matrix multiplication.
Moreover, the difference of the time consumption during the muliplication of a dense*denseT and an identity matrix is very slim.
So I tried to found in CLAPACK f2c DGEMM where is the optimization for the multiplication of a parse matrix, and I found a condition on null values.
/* Form C := alpha*A*B + beta*C. */
i__1 = *n;
for (j = 1; j <= i__1; ++j) {
if (*beta == 0.) {
i__2 = *m;
for (i__ = 1; i__ <= i__2; ++i__) {
c__[i__ + j * c_dim1] = 0.;
/* L50: */
}
} else if (*beta != 1.) {
i__2 = *m;
for (i__ = 1; i__ <= i__2; ++i__) {
c__[i__ + j * c_dim1] = *beta * c__[i__ + j * c_dim1];
/* L60: */
}
}
i__2 = *k;
for (l = 1; l <= i__2; ++l) {
if (b[l + j * b_dim1] != 0.) { // HERE THE CONDITION
temp = *alpha * b[l + j * b_dim1];
i__3 = *m;
for (i__ = 1; i__ <= i__3; ++i__) {
c__[i__ + j * c_dim1] += temp * a[i__ + l *
a_dim1];
/* L70: */
} // ENF of condition
}
When I removed this condition I got this kind of results :
NAME | Matrix [Size,Size] | CLAPACK f2c (second) | MKL_GNU_THREAD (second) |
---|---|---|---|
Multiplication of an identity matrix by itself | 5000 | 93.210873 | 1.090167 |
Multiplication of dense matrix by its transpose | 5000*5000 | 93.71569 | 1.113872 |
- Here we note that the multiplication of a dense and an identity is very clause in term of performances, and now the MKL shows the best performances.
- The MKL multiplication seems to be faster than CLAPACK f2c but only with the same number of non-null elements.
I have two ideas on this results :
The 0 optimization is not activated by default in the MKL
The MKL cannot see the 0 (double) values inside my sparse matrices .
May you tell me why the MKL shows performance issues ? Do you have any tips in order to bypass the multiplication on null elements with dgemm ?
I did a conservion in CSR and it shows better performances but in is case why lapacke_dgemm is worst than f2c_dgemmm.
Thank you for your help :)
MKL_VERBOSE Intel(R) MKL 2021.0 Update 1 Product build 20201104 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 3.50GHz lp64 gnu_thread