MPI + CUDA AWARE, concurrents kernels and MPI_Sendrecv

465 views Asked by At

During my work, I've found a little problem. Now I'm using MVAPICH-GDR-2.05 and Open MPI 1.7.4 with CUDA 6.0.

I'm working on the exchange of non contiguous elements between GPUs (like the columns of a matrix), and I'm trying to run two kernel's (one for scatter and one for gather) and a communication with MPI_Sendrecv between two GPUs concurrently.

I've used the CUDA profiler (nvprof) to see what my program is doing, and I've seen strange things:

  • With Open MPI 1.7.4, I've 3 cuda streams works concurrently.
  • With MVAPICH-gdr-2.05, I've two concurrent kernel's and the MPI_Sendrecv is not concurrent with them.

Do you know why MPI_Sendrecv in MVAPICH does this?

This is my pseudocode:

// creation and initialization of streams
cudaStream_t stream1, stream2;
cudaStreamCreateWithFlags( stream1, cudaStreamNonBlocking )
cudaStreamCreateWithFlags( stream2, cudaStreamNonBlocking )

///////////////////////////////////////////////////////////////////

// 1) --> gather of the first chunk
gather_kernel <<< dimGrid, dimBlock, 0, stream1 >>> ( ... )
cudaStreamSynchronize(stream1)

// 2) --> gather of the second chunk
//    --> communication of the first chunk
gather_kernel <<< dimGrid, dimBlock, 0, stream1 >>> ( ... )
MPI_Sendrecv( ... )
cudaStreamSynchronize(stream1)

// 3) --> scatter of the chunk (ii)
//    --> gather of the chunk (ii+2)
//    --> communication of the chunk (ii+1)
// K is the number of chunk
for ( ii=0; ii<K-2; ii++ ){
    scatter_kernel <<< dimGrid, dimBlock, 0, stream2 >>> ( ... )
    gather_kernel  <<< dimGrid, dimBlock, 0, stream1 >>> ( ... )
    MPI_Sendrecv( ... )
    cudaStreamSynchronize(stream2)
    cudaStreamSynchronize(stream1)
}

// 4) --> scatter of the penultimate chunk
//    --> communication of the last chunk
scatter_kernel <<< dimGrid, dimBlock, 0, stream2 >>> ( ... )
MPI_Sendrecv( ... )
cudaStreamSynchronize(stream2)

// 5) --> scatter of the last chunk
scatter_kernel <<< dimGrid, dimBlock, 0, stream2 >>> ( ... )
cudaStreamSynchronize(stream2)

And these are the two profiler's screenshoot:

0

There are 0 answers