Getting max FLOPS for dense matrix multiplication with the Xeon Phi Knights Landing

1.3k views Asked by At

I recently started working with a Xeon Phi Knights Landing (KNL) 7250 computer (http://ark.intel.com/products/94035/Intel-Xeon-Phi-Processor-7250-16GB-1_40-GHz-68-core).

This has 68 cores and AVX 512. The base frequency is 1.4 GHz and the Turbo Frequency is 1.6 GHz. I don't know what the turbo frequency is for all cores because usually the turbo frequency is quoted for only one core.

Each Knights Landing core can do two 8-wide double FMA operations per cycle. Since each FMA operations is two floating point operations the double floating point operations per cycle per core is 32.

Therefore the maximum GFLOPS is 32*68*1.4 = 3046.4 DP GFLOPS.

For a single core the peak FLOPS is 32*1.6 = 51.2 DP GLOPS.

Dense matrix multiplication is one of the few operations that actually is capable of getting close to the peak flops. The Intel MKL library provides optimized dense matrix multiplication functions. On a Sandy Bridge systems I obtained better than 97% of the peak FLOPS with DGEMM. On Haswell I got about 90% of the peak when I checked a few years ago so it was clearly more difficult to obtain the peak with FMA at the time. However, with Knights Landing and MKL I get much less than 50% of the peak.

I modified the dgemm_example.c file in the MKL examples directory to calculate the GFLOPS using 2.0*1E-9*n*n*n/time (see below).

I have also tried export KMP_AFFINITY=scatter and export OMP_NUM_THREADS=68 but that does not appear to make a difference. However, KMP_AFFINITY=compact is significantly slower and so is OMP_NUM_THREADS=1 so the default thread topology appears to be scattered anyway and threading is working.

The best GFLOPS I have seen is about 1301 GFLOPS which is about 43% of the peak. For one thread I have seen 38 GFLOPS which is about 74% of the peak. This tells me that MKL DGEMM is optimized for AVX512 otherwise it would see less than 50%. On the other hand for a single thread I think I should get 90% of the peak.

The KNL memory can operate in three modes (cached, flat, and hybrid) which can be set from the BIOS (http://www.anandtech.com/show/9794/a-few-notes-on-intels-knights-landing-and-mcdram-modes-from-sc15). I don't know what mode my (or rather my work's) KNL system is in. Could this have an impact on DGEMM?

My question is why is the FLOPS from DGEMM so low and what can I do to improve it? Maybe I have not configured MKL optimally ( I am using ICC 17.0).

source /opt/intel/mkl/bin/mklvars.sh  intel64
icc -O3 -mkl dgemm_example.c 

Here is the code

#define min(x,y) (((x) < (y)) ? (x) : (y))

#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"
#include "omp.h"

int main()
{
    double *A, *B, *C;
    int m, n, k, i, j;
    double alpha, beta;

    printf ("\n This example computes real matrix C=alpha*A*B+beta*C using \n"
            " Intel(R) MKL function dgemm, where A, B, and  C are matrices and \n"
            " alpha and beta are double precision scalars\n\n");

    m = 30000, k = 30000, n = 30000;
    printf (" Initializing data for matrix multiplication C=A*B for matrix \n"
            " A(%ix%i) and matrix B(%ix%i)\n\n", m, k, k, n);
    alpha = 1.0; beta = 0.0;

    printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n"
            " performance \n\n");
    A = (double *)mkl_malloc( m*k*sizeof( double ), 64 );
    B = (double *)mkl_malloc( k*n*sizeof( double ), 64 );
    C = (double *)mkl_malloc( m*n*sizeof( double ), 64 );
    if (A == NULL || B == NULL || C == NULL) {
      printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n");
      mkl_free(A);
      mkl_free(B);
      mkl_free(C);
      return 1;
    }

    printf (" Intializing matrix data \n\n");
    for (i = 0; i < (m*k); i++) {
        A[i] = (double)(i+1);
    }

    for (i = 0; i < (k*n); i++) {
        B[i] = (double)(-i-1);
    }

    for (i = 0; i < (m*n); i++) {
        C[i] = 0.0;
    }

    printf (" Computing matrix product using Intel(R) MKL dgemm function via CBLAS interface \n\n");
    double dtime;
    dtime = -omp_get_wtime();

    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 
                m, n, k, alpha, A, k, B, n, beta, C, n);
    dtime += omp_get_wtime();
    printf ("\n Computations completed.\n\n");
    printf ("time %f\n", dtime);
    printf ("GFLOPS %f\n", 2.0*1E-9*m*n*k/dtime);

    printf (" Top left corner of matrix A: \n");
    for (i=0; i<min(m,6); i++) {
      for (j=0; j<min(k,6); j++) {
        printf ("%12.0f", A[j+i*k]);
      }
      printf ("\n");
    }

    printf ("\n Top left corner of matrix B: \n");
    for (i=0; i<min(k,6); i++) {
      for (j=0; j<min(n,6); j++) {
        printf ("%12.0f", B[j+i*n]);
      }
      printf ("\n");
    }

    printf ("\n Top left corner of matrix C: \n");
    for (i=0; i<min(m,6); i++) {
      for (j=0; j<min(n,6); j++) {
        printf ("%12.5G", C[j+i*n]);
      }
      printf ("\n");
    }

    printf ("\n Deallocating memory \n\n");
    mkl_free(A);
    mkl_free(B);
    mkl_free(C);

    printf (" Example completed. \n\n");
    return 0;
}
0

There are 0 answers