I am basically looking for a way to synchronize a stream from within the device. I want to avoid using cudaDeviceSynchronize(), as it would serialize execution of my kernel that I want to execute concurrently using streams;
More detailed description: I have written a kernel, that is a stabilized bi-conjugate gradient solver. I want to lunch this kernel concurrently on different data using streams.
This kernel uses cublas functions. They are called from within the kernel.
One of operations required by the solver is calculation of a dot product of two vectors. This can be done with cublasdot(). But as this call is synchronous, execution of kernels in different streams get serialized. Instead of calling a dot product function, I calculate the dot product using cublasspmv(), which is called asynchronously. The problem is that this function returns before the result is calculated. I want therefore to synchronize the stream from the device - I am looking for an equivalent of cudaStreamSynchronize() but callable from the device.
__device__ float _cDdot(cublasHandle_t & cublasHandle, const int n, real_t * x, real_t * y) {
      float *norm; norm = new float; 
      float alpha = 1.0f; float beta = 0.0f;
      cublasSgemv_v2(cublasHandle, CUBLAS_OP_N ,1 , n, &alpha, x, 1, y, 1, &beta, norm, 1);
      return *norm;
}
What can I do to make sure, that the result is calculated before the function returns? Of course insertion of cudaDeviceSynchronize() works, but as I mentioned, it serializes the execution of my kernel across streams.
 
                        
Probably if you read the programming guide dynamic parallelism section carefully (especially streams, events, and synchronization), you may get some ideas. Here's what I came up with:
There is an implicit NULL stream (on the device) associated with the execution sequence that calls your
_cDdotfunction (oddly named, IMHO, since you're working withfloatquantities in that case, i.e. usingSgemv). Therefore, any cuda kernel or API call issued after the call tocublasSgemv_v2in your function should wait until any cuda activity associated with thecublasSgemv_v2function is complete. If you insert an innocuous cuda API call, or else a dummy kernel call, after the call tocublasSgemv_v2, it should wait for that to be complete. This should give you the thread-level synchronization you are after. You might also be able to use acudaEventRecordcall followed by acudaStreamWaitEventcall.Here's an example to show the implicit stream synchronization approach:
compile with:
results:
Unfortunately I tried a completely empty
dummy_kernel; that did not work, unless I compiled with-G. So the compiler may be smart enough to optimize out a complete empty child kernel call.