Example of dynamic parallelism:
__global__ void nestedHelloWorld(int const iSize,int iDepth) {
int tid = threadIdx.x;
printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,blockIdx.x);
// condition to stop recursive execution
if (iSize == 1) return;
// reduce block size to half
int nthreads = iSize>>1;
// thread 0 launches child grid recursively
if(tid == 0 && nthreads > 0) {
nestedHelloWorld<<<1, nthreads>>>(nthreads,++iDepth);
printf("-------> nested execution depth: %d\n",iDepth);
}
}
Prints with one block, with two blocks the entire parent grid has finished:
./nestedHelloWorld Execution Configuration: grid 1 block 8
Recursion=0: Hello World from thread 0 block 0
Recursion=0: Hello World from thread 1 block 0
Recursion=0: Hello World from thread 2 block 0
Recursion=0: Hello World from thread 3 block 0
Recursion=0: Hello World from thread 4 block 0
Recursion=0: Hello World from thread 5 block 0
Recursion=0: Hello World from thread 6 block 0
Recursion=0: Hello World from thread 7 block 0
-------> nested execution depth: 1
Recursion=1: Hello World from thread 0 block 0
Recursion=1: Hello World from thread 1 block 0
Recursion=1: Hello World from thread 2 block 0
Recursion=1: Hello World from thread 3 block 0
-------> nested execution depth: 2
Recursion=2: Hello World from thread 0 block 0
Recursion=2: Hello World from thread 1 block 0
-------> nested execution depth: 3
Recursion=3: Hello World from thread 0 block 0
Say I launch a child grid from one thread in a block at threadIdx.x==0. Can I assume that all other threads in the parent grid have finished executing up to the point I launched the child grid as well?
If so, how does this work? All I'm reading is that a parent grid is not technically finished before a child grid. Nothing about guarantees of other parent threads that have not launched children.
No. You can make no assumptions about the state of other threads in the parent block or other blocks in the parent grid.
When a parent thread launches a child grid it pushes work to the GPU at a higher priority than itself. On compute capability 3.5 - 5.x the GPU will schedule the highest priority work but it will not pre-empt any running blocks. If the GPU is full then the compute work distribution will not be able schedule the child blocks. As parent blocks complete the child blocks will be distributed before any new parent blocks. At this point the design could still dead lock. If the block that launched the work does a join operation (cudaDeviceSynchronize) and if the children work has not be completed because there was not sufficient room to schedule the child work or it is still running then the parent block (not grid) will pre-empt itself. This allows for the child grid to make forward progress. The CDP scheduler will restore the parent block when the child grid has completed.
The parent grid will not be marked as completed until all blocks from the parent complete and all child grids complete.
Nsight VSE CUDA Trace and Visual Profiler have additional visualizers for tracing CDP grids. The video (but not slides) from the GTC 2013 presentation Profiling and Optimizing CUDA Kernel Code with NVIDIA Nsight Visual Studio Edition provide the best documentation on CDP visualization. Start watching at time 17:15.