I am trying to distribute the work of multiplying two NxN matrices across 3 nVidia GPUs using 3 OpenMP threads. (The matrix values will get large hence the long long data type.) However I am having trouble placing the #pragma acc parallel loop in the correct place. I have used some examples in the nVidia PDFs shared but to no luck. I know that the inner most loop cannot be parallelized. But I would like each of the three threads to own a GPU and do a portion of the work. Note that input and output matrices are defined as global variables as I kept running out of stack memory.
I have tried the code below, but I get compilation errors all pointing to line 75 which is the #pragma acc parallel loop line
[test@server ~]pgcc -acc -mp -ta=tesla:cc60 -Minfo=all -o testGPU matrixMultiplyopenmp.c
PGC-S-0035-Syntax error: Recovery attempted by replacing keyword for by keyword barrier (matrixMultiplyopenmp.c: 75)
PGC-S-0035-Syntax error: Recovery attempted by replacing acc by keyword enum (matrixMultiplyopenmp.c: 76)
PGC-S-0036-Syntax error: Recovery attempted by inserting ';' before keyword for (matrixMultiplyopenmp.c: 77)
PGC/x86-64 Linux 18.10-1: compilation completed with severe errors
Function is:
void multiplyMatrix(long long int matrixA[SIZE][SIZE], long long int matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE])
{
// Get Nvidia device type
acc_init(acc_device_nvidia);
// Get Number of GPUs in system
int num_gpus = acc_get_num_devices(acc_device_nvidia);
//Set the number of OpenMP thread to the number of GPUs
#pragma omp parallel num_threads(num_gpus)
{
//Get thread openMP number and set the GPU device to that number
int threadNum = omp_get_thread_num();
acc_set_device_num(threadNum, acc_device_nvidia);
int row;
int col;
int key;
#pragma omp for
#pragma acc parallel loop
for (row = 0; row < SIZE; row++)
for (col = 0; col < SIZE; col++)
for (key = 0; key < SIZE; key++)
matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
}
}
As fisehara points out, you can't have both an OpenMP "for" loop combined with an OpenACC parallel loop on the same for loop. Instead, you need to manually decompose the work across the OpenMP threads. Example below.
Is there a reason why you want to use multiple GPUs here? Most likely the matrix multiply will fit on to a single GPU so there's no need for the extra overhead of introducing host-side parallelization.
Also, I generally recommend using MPI+OpenACC for multi-gpu programming. Domain decomposition is naturally part of MPI but not inherent in OpenMP. Also, MPI gives you a one-to-one relationship between the host process and accelerator, allows for scaling beyond a single node, and you can take advantage of CUDA Aware MPI for direct GPU to GPU data transfers. For more info, do a web search for "MPI OpenACC" and you'll find several tutorials. Class #2 at https://developer.nvidia.com/openacc-advanced-course is a good resource.