I am in the process of trying to optimize some matrix-matrix multiplication benchmark code that uses OpenMP on a MAESTRO processor. The MAESTRO has 49 processors arranged in a two-dimensional array in a 7x7 configuration. Each core has its own L1 and L2 cache. A layout of the board can be seen here: https://i.stack.imgur.com/RG0fC.png.
My main question is: Can different data types (char vs short vs int, etc.) directly impact the performance of OpenMP code on NUMA-based processors? If so, is there a way to alleviate it? Below is my explanation of why I am asking this.
I was given a set of benchmarks that had been used by a research group to measure the performance of a given processor. The benchmarks resulted in increased performance for other processors, but they ran into the issue of not seeing the same type of results when running them on the MAESTRO. Here is a snippet of the matrix multiplication benchmark from the base code I received:
Relevant macros from header file (MAESTRO is 64-bit):
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <sys/time.h>
#include <cblas.h>
#include <omp.h>
//set data types
#ifdef ARCH64
//64-bit architectures
#define INT8_TYPE char
#define INT16_TYPE short
#define INT32_TYPE int
#define INT64_TYPE long
#else
//32-bit architectures
#define INT8_TYPE char
#define INT16_TYPE short
#define INT32_TYPE long
#define INT64_TYPE long long
#endif
#define SPFP_TYPE float
#define DPFP_TYPE double
//setup timer
//us resolution
#define TIME_STRUCT struct timeval
#define TIME_GET(time) gettimeofday((time),NULL)
#define TIME_DOUBLE(time) (time).tv_sec+1E-6*(time).tv_usec
#define TIME_RUNTIME(start,end) TIME_DOUBLE(end)-TIME_DOUBLE(start)
//select random seed method
#ifdef FIXED_SEED
//fixed
#define SEED 376134299
#else
//based on system time
#define SEED time(NULL)
#endif
32-bit integer matrix multiplication benchmark:
double matrix_matrix_mult_int32(int size,int threads)
{
//initialize index variables, random number generator, and timer
int i,j,k;
srand(SEED);
TIME_STRUCT start,end;
//allocate memory for matrices
INT32_TYPE *A=malloc(sizeof(INT32_TYPE)*(size*size));
INT32_TYPE *B=malloc(sizeof(INT32_TYPE)*(size*size));
INT64_TYPE *C=malloc(sizeof(INT64_TYPE)*(size*size));
//initialize input matrices to random numbers
//initialize output matrix to zeros
for(i=0;i<(size*size);i++)
{
A[i]=rand();
B[i]=rand();
C[i]=0;
}
//serial operation
if(threads==1)
{
//start timer
TIME_GET(&start);
//computation
for(i=0;i<size;i++)
{
for(k=0;k<size;k++)
{
for(j=0;j<size;j++)
{
C[i*size+j]+=A[i*size+k]*B[k*size+j];
}
}
}
//end timer
TIME_GET(&end);
}
//parallel operation
else
{
//start timer
TIME_GET(&start);
//parallelize with OpenMP
#pragma omp parallel for num_threads(threads) private(i,j,k)
for(i=0;i<size;i++)
{
for(k=0;k<size;k++)
{
for(j=0;j<size;j++)
{
C[i*size+j]+=A[i*size+k]*B[k*size+j];
}
}
}
//end timer
TIME_GET(&end);
}
//free memory
free(C);
free(B);
free(A);
//compute and return runtime
return TIME_RUNTIME(start,end);
}
Running the above benchmark serially resulted in better performance than running it with OpenMP. I was tasked with optimizing the benchmark for the MAESTRO to get better performance. Using the following code, I was able to get a performance increase:
double matrix_matrix_mult_int32(int size,int threads)
{
//initialize index variables, random number generator, and timer
int i,j,k;
srand(SEED);
TIME_STRUCT start,end;
//allocate memory for matrices
alloc_attr_t attrA = ALLOC_INIT;
alloc_attr_t attrB = ALLOC_INIT;
alloc_attr_t attrC = ALLOC_INIT;
alloc_set_home(&attrA, ALLOC_HOME_INCOHERENT);
alloc_set_home(&attrB, ALLOC_HOME_INCOHERENT);
alloc_set_home(&attrC, ALLOC_HOME_TASK);
INT32_TYPE *A=alloc_map(&attrA, sizeof(INT32_TYPE)*(size*size));
INT32_TYPE *B=alloc_map(&attrB, sizeof(INT32_TYPE)*(size*size));
INT64_TYPE *C=alloc_map(&attrC, sizeof(INT64_TYPE)*(size*size));
#pragma omp parallel for num_threads(threads) private(i)
for(i=0;i<(size*size);i++)
{
A[i] = rand();
B[i] = rand();
C[i] = 0;
tmc_mem_flush(&A[i], sizeof(A[i]));
tmc_mem_flush(&B[i], sizeof(B[i]));
tmc_mem_inv(&A[i], sizeof(A[i]));
tmc_mem_inv(&B[i], sizeof(B[i]));
}
//serial operation
if(threads==1)
{
//start timer
TIME_GET(&start);
//computation
for(i=0;i<size;i++)
{
for(k=0;k<size;k++)
{
for(j=0;j<size;j++)
{
C[i*size+j]+=A[i*size+k]*B[k*size+j];
}
}
}
TIME_GET(&end);
}
else
{
TIME_GET(&start);
#pragma omp parallel for num_threads(threads) private(i,j,k) schedule(dynamic)
for(i=0;i<size;i++)
{
for(j=0;j<size;j++)
{
for(k=0;k<size;k++)
{
C[i*size+j] +=A[i*size+k]*B[k*size+j];
}
}
}
TIME_GET(&end);
}
alloc_unmap(C, sizeof(INT64_TYPE)*(size*size));
alloc_unmap(B, sizeof(INT32_TYPE)*(size*size));
alloc_unmap(A, sizeof(INT32_TYPE)*(size*size));
//compute and return runtime
return TIME_RUNTIME(start,end);
}
Making the caching of the two input arrays incoherent and using OpenMP with dynamic scheduling helped me get the parallelized performance to surpass the serial performance. This is my first experience with a processor with a NUMA architecture so my 'optimizations' are light since I am still learning. Anyways, I tried using the same optimizations with the 8-bit integer version of the above code with all of the same conditions (number of threads and array sizes):
double matrix_matrix_mult_int8(int size,int threads)
{
//initialize index variables, random number generator, and timer
int i,j,k;
srand(SEED);
TIME_STRUCT start,end;
//allocate memory for matrices
alloc_attr_t attrA = ALLOC_INIT;
alloc_attr_t attrB = ALLOC_INIT;
alloc_attr_t attrC = ALLOC_INIT;
alloc_set_home(&attrA, ALLOC_HOME_INCOHERENT);
alloc_set_home(&attrB, ALLOC_HOME_INCOHERENT);
alloc_set_home(&attrC, ALLOC_HOME_TASK);
INT8_TYPE *A=alloc_map(&attrA, sizeof(INT8_TYPE)*(size*size));
INT8_TYPE *B=alloc_map(&attrB, sizeof(INT8_TYPE)*(size*size));
INT16_TYPE *C=alloc_map(&attrC, sizeof(INT16_TYPE)*(size*size));
#pragma omp parallel for num_threads(threads) private(i)
for(i=0;i<(size*size);i++)
{
A[i] = rand();
B[i] = rand();
C[i] = 0;
tmc_mem_flush(&A[i], sizeof(A[i]));
tmc_mem_flush(&B[i], sizeof(B[i]));
tmc_mem_inv(&A[i], sizeof(A[i]));
tmc_mem_inv(&B[i], sizeof(B[i]));
}
//serial operation
if(threads==1)
{
//start timer
TIME_GET(&start);
//computation
for(i=0;i<size;i++)
{
for(k=0;k<size;k++)
{
for(j=0;j<size;j++)
{
C[i*size+j]+=A[i*size+k]*B[k*size+j];
}
}
}
TIME_GET(&end);
}
else
{
TIME_GET(&start);
#pragma omp parallel for num_threads(threads) private(i,j,k) schedule(dynamic)
for(i=0;i<size;i++)
{
for(j=0;j<size;j++)
{
for(k=0;k<size;k++)
{
C[i*size+j] +=A[i*size+k]*B[k*size+j];
}
}
}
TIME_GET(&end);
}
alloc_unmap(C, sizeof(INT16_TYPE)*(size*size));
alloc_unmap(B, sizeof(INT8_TYPE)*(size*size));
alloc_unmap(A, sizeof(INT8_TYPE)*(size*size));
//compute and return runtime
return TIME_RUNTIME(start,end);
}
However, the 8-bit OpenMP version resulted in a time that was slower than the 32-bit OpenMP version. Shouldn't the 8-bit version execute faster than the 32-bit version? What could be the cause of this discrepancy and what are some possible things that could alleviate it? Could it be related to the data type of the arrays I'm using or something else?
two things that come to mind are
your 8-bit (one btye) data type versus a 32-bit (four btye) data type and the given compiler aligning data structures to N-byte boundaries. I think it's typically 4-byte boundaries, especially when it's defaulted to 32-bit. There is a compiler option to force the alignment boundaries.
Why does compiler align N byte data types on N byte boundaries?
There might be extra operations happening to handle a one-byte data type where the other 3 bytes have to be masked off in order to get the correct value, versus no masking operations happening with a standard 32-bit (or 64-bit) data type.
The other is processor and memory affinity, and whether the parallel OPENMP code that is being run on a given core is fetching or writing data from memory that is not connected directly to that cpu core. Then whatever hub(s) the fetching has to go through to reach memory that is distant will obviously cause an increase in run time. I'm not sure if this applies to your MAESTRO type of system which I am unfamiliar with; but what i'm describing is on late model INTEL 4-cpu systems which are connected via Intel quickpath connect (QPI). For example if you are running on core 0 of cpu 0, then fetching from memory of DRAM modules closest to that cpu core will be fastest, versus accessing DRAM over QPI connected to core N on CPU 3, versus going through some hub or infiniband to access DRAM on some other blade or node, and so on. I know affinity can be handled with MPI, and I believe it can be with OPENMP but maybe not as well. You might try researching "openmp cpu memory affinity".