Lets take the following code where there is a parent and child kernel. From said parent kernel we wish to start threadIdx.x
child kernels in different streams to maximize parallel throughput. We then wait for those children with cudaDeviceSynchronize()
as the parent kernel needs to see the changes made to global
memory.
Now lets say we also wish to start n
parent kernels with streams and, between each set of n
parent kernels we wish to start in parallel, we also must wait for results using cudaDeviceSynchronize()
How would this behave?
From this official introduction to Dynamic Parallelism by Nvidia i would think that parent kernel[0]
would only wait for the streams started within it. is this correct? If not, what happens?
NOTE: i am aware that only so many streams can run at once (32 in my case) but this is more to maximize occupancy
EDIT: a little code sample
__global__ void child_kernel (void) {}
__global__ void parent_kernel (void)
{
if (blockIdx.x == 0)
{
cudaStream_t s;
cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);
child_kernel <<<1,10,0,s>>> ();
cudaStreamDestroy(s);
}
cudaDeviceSynchronize();
}
for (int i=0; i<10; i++)
{
cudaStream_t s;
cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);
parent_kernel <<<10,10,0,s>>> ();
cudaStreamDestroy(s);
}
cudaDeviceSynchronize();
Parent kernels will wait on any spawned child kernels to complete, before the parent kernel completes. This is covered in the dynamic parallelism documentation:
Any other semantics should be inferrable from ordinary stream semantics, that is: an activity launched into a particular stream will not begin until all previous activity launched into that stream has completed. Similarly, there is no enforced ordering between activities launched into separate streams.
In your example (or indeed in any example), the parent kernel will wait until all child kernels launched from that parent kernel have completed, regardless of what streams are used or not used.
It's not clear you were asking about this, but note that for the device code in your example,
cudaDeviceSynchronize()
only guarantees that that thread will wait for the child kernel to complete, and likewise only enforces results visibility ordering for that thread. If you wish other threads in the same block to be able to witness global memory results from the child kernel spawned by thread 0 (just to pick an example), then you would want to follow up the cudaDeviceSynchronize() operation in thread 0 with a__syncthreads()
operation. After that__syncthreads()
, other threads in the same block will have guaranteed visibility into the global results produced by the child kernel launched by thread 0 (or by the child kernel launched by any thread, followed by a cudaDeviceSynchronize() call, that precedes the aforementioned__syncthreads()
).A couple of other limits to be aware of in a CDP environment are the nesting limit and the pending launch limit.