How to make multi CUBLAS APIs (eg. cublasDgemm) really execute concurrently in multi cudaStream

2k views Asked by At

I want to make two CUBLAS APIs(eg.cublasDgemm) really execute concurrently in two cudaStreams.

As we know, the CUBLAS API is asynchronous,level 3 routines like cublasDgemm don't block the host,that means the following codes (in default cudaStream) will run on concurrently:

cublasDgemm();
cublasDgemm();

BUT,when I profile the program with "NVIDIA Visual Profiler" , it shows that they run on orderly.

Then,I try to make them bind to different cudaStreams,the pseudocode is:

// Create a stream for every DGEMM operation
cudaStream_t *streams = (cudaStream_t *) malloc(batch_count*sizeof(cudaStream_t));
for(i=0; i<batch_count; i++)
    cudaStreamCreate(&streams[i]);

// Set matrix coefficients
double alpha = 1.0;
double beta  = 1.0;

// Launch each DGEMM operation in own CUDA stream
for(i=0; i<batch_count; i++){
    // Set CUDA stream
    cublasSetStream(handle, streams[i]);

    // DGEMM: C = alpha*A*B + beta*C
    cublasDgemm(handle,
                CUBLAS_OP_N, CUBLAS_OP_N,
                dim, dim, dim,
                &alpha,
                d_A[i], dim,
                d_B[i], dim,
                &beta,
                d_C[i], dim);
}

When the batch_count=5, the result showed by "NVIDIA Visual Profiler " is :

Multi-CublasDegmm Rountines Execution Result With Multi-Streams

The result shows that they still run on orderly. How to make multi cublas apis run on really concurrently in multi cudaStreams,like this:

Multi-Kernels Execution Result With Multi-Streams,They Run on Really Concurrnently

Does anybody has any idea ? Thanks.

1

There are 1 answers

7
Yangsong Zhang On

Firstly, thanks for @Robert Crovella's comments.

According to @Robert Crovella's help and my research,we can run multi CUBLAS APIs(e.g. cublasDgemm) concurrently in some special cases, but most of cases can not.

CASE 1:When I execute cublasDgemm with large dims of (m=n=k=1024*8) on K40, the profiler show the result as following: cublasDgemm with dims of (M=N=K=1024*8)

CASE 2:When I execute cublasDgemm with small dims of (m=n=k=64) on K40, the profiler show the result as following: cublasDgemm with dims of (M=N=K=64)

CASE 3:BUT when I execute cublasDgemm with dims of (m=n=k=256) on K40, the profiler show the result as following: cublasDgemm with dims of (M=N=K=256)

From the result of CASE 1 and CASE 2 ,it shows that we can not, not only with large dims and also small dims, run CUBLAS APIs concurrently. The reason for case 1 is that the gpu resources have been used up,so no left room to run another routines, and for case 2, it is the latency of two kernels launch that cause it's difficulty to see con.