I looking for a way how to get rid of busy waiting in host thread in fallowing code (do not copy that code, it only shows an idea of my problem, it has many basic bugs):
cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
while (true) {
if (cudaStreamQuery(streams[sid])) == cudaSuccess) { //BUSY WAITING !!!!
cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
break;
}
sid = ++sid % S_N;
}
}
Is there a way to idle host thread and wait somehow to some stream to finish, and then prepare and run another stream?
EDIT: I added while(true) into the code, to emphasize busy waiting. Now I execute all the streams, and check which of them finished to run another new one. cudaStreamSynchronize
waits for particular stream to finish, but I want to wait for any of the streams which as a first finished the job.
EDIT2: I got rid of busy-waiting in fallowing way:
cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
sid = ++sid % S_N;
}
for (int i = 0; i < S_N; i++) {
cudaStreamSynchronize(streams[i]);
cudaStreamDestroy(streams[i]);
}
But it appears to be a little bit slower than the version with busy-waiting on host thread. I think it is because, now I statically distribute the jobs on streams, so when the one stream finishes work it is idle till each of the stream finishes the work. The previous version dynamically distributed the work to the first idle stream, so it was more efficient, but there was busy-waiting on the host thread.
My idea to solve that problem is to have one host thread per one stream. That host thread would invoke cudaStreamSynchronize to wait till the stream commands are completed. Unfortunately it is not possible in CUDA 3.2 since it allows only one host thread deal with one CUDA context, it means one host thread per one CUDA enabled GPU.
Hopefully, in CUDA 4.0 it will be possible: CUDA 4.0 RC news
EDIT: I have tested in CUDA 4.0 RC, using open mp. I created one host thread per cuda stream. And it started to work.