I am writing a simple multi-stream CUDA application. Following is the part of code where I create cuda-streams
, cublas-handle
and cudnn-handle
:
cudaSetDevice(0);
int num_streams = 1;
cudaStream_t streams[num_streams];
cudnnHandle_t mCudnnHandle[num_streams];
cublasHandle_t mCublasHandle[num_streams];
for (int ii = 0; ii < num_streams; ii++) {
cudaStreamCreateWithFlags(&streams[ii], cudaStreamNonBlocking);
cublasCreate(&mCublasHandle[ii]);
cublasSetStream(mCublasHandle[ii], streams[ii]);
cudnnCreate(&mCudnnHandle[ii]);
cudnnSetStream(mCudnnHandle[ii], streams[ii]);
}
Now, my stream count is 1. But when I profile the executable of above application using Nvidia Visual Profiler I get following:
For every stream I create it creates additional 4 more streams. I tested it with num_streams = 8
, it showed 40 streams in profiler. It raised following questions in my mind:
- Does
cudnn
internally create streams? If yes, then why? - If it implicitly creates streams then what is the way to utilize it?
- In such case does explicitly creating streams make any sense?
Yes.
Because it is a library, and it may need to organize CUDA concurrency. Streams are used to organize CUDA concurrency. If you want a detailed explanation of what exactly the streams are used for, the library internals are not documented.
Those streams are not intended for you to utilize separately/independently. They are for usage by the library, internal to the library routines.
You would still need to explicitly create any streams you needed to manage CUDA concurrency outside of the library usage.
I would like to point out that this statement is a bit misleading:
"For every stream I create it creates additional 4 more streams."
What you are doing is going through a loop, and at each loop iteration you are creating a new handle. Your observation is tied to the number of handles you create, not the number of streams you create.