Understanding Dynamic Parallelism in CUDA

1.4k views Asked by At

Example of dynamic parallelism:

__global__ void nestedHelloWorld(int const iSize,int iDepth) {
    int tid = threadIdx.x;
    printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,blockIdx.x);
    // condition to stop recursive execution
    if (iSize == 1) return;
    // reduce block size to half
    int nthreads = iSize>>1;
    // thread 0 launches child grid recursively
    if(tid == 0 && nthreads > 0) {
        nestedHelloWorld<<<1, nthreads>>>(nthreads,++iDepth);
        printf("-------> nested execution depth: %d\n",iDepth);
    }
}

Prints with one block, with two blocks the entire parent grid has finished:

./nestedHelloWorld Execution Configuration: grid 1 block 8
Recursion=0: Hello World from thread 0 block 0
Recursion=0: Hello World from thread 1 block 0
Recursion=0: Hello World from thread 2 block 0
Recursion=0: Hello World from thread 3 block 0
Recursion=0: Hello World from thread 4 block 0
Recursion=0: Hello World from thread 5 block 0
Recursion=0: Hello World from thread 6 block 0
Recursion=0: Hello World from thread 7 block 0
-------> nested execution depth: 1
Recursion=1: Hello World from thread 0 block 0
Recursion=1: Hello World from thread 1 block 0
Recursion=1: Hello World from thread 2 block 0
Recursion=1: Hello World from thread 3 block 0
-------> nested execution depth: 2
Recursion=2: Hello World from thread 0 block 0
Recursion=2: Hello World from thread 1 block 0
-------> nested execution depth: 3
Recursion=3: Hello World from thread 0 block 0

Say I launch a child grid from one thread in a block at threadIdx.x==0. Can I assume that all other threads in the parent grid have finished executing up to the point I launched the child grid as well?

If so, how does this work? All I'm reading is that a parent grid is not technically finished before a child grid. Nothing about guarantees of other parent threads that have not launched children.

2

There are 2 answers

2
Greg Smith On BEST ANSWER

Say I launch a child grid from one thread in a block at threadIdx.x==0. Can I assume that all other threads in the parent grid have finished executing up to the point I launched the child grid as well?

No. You can make no assumptions about the state of other threads in the parent block or other blocks in the parent grid.

If so, how does this work? All I'm reading is that a parent grid is not technically finished before a child grid. Nothing about guarantees of other parent threads that have not launched children.

When a parent thread launches a child grid it pushes work to the GPU at a higher priority than itself. On compute capability 3.5 - 5.x the GPU will schedule the highest priority work but it will not pre-empt any running blocks. If the GPU is full then the compute work distribution will not be able schedule the child blocks. As parent blocks complete the child blocks will be distributed before any new parent blocks. At this point the design could still dead lock. If the block that launched the work does a join operation (cudaDeviceSynchronize) and if the children work has not be completed because there was not sufficient room to schedule the child work or it is still running then the parent block (not grid) will pre-empt itself. This allows for the child grid to make forward progress. The CDP scheduler will restore the parent block when the child grid has completed.

The parent grid will not be marked as completed until all blocks from the parent complete and all child grids complete.

  • If the parent grid launches a child grid but does not join it is possible that all parent blocks completes before the child blocks are scheduled.
  • If the parent grid joins then it is likely that all children grid complete before the parent blocks complete.
  • If the parent grid launch is more than can be concurrently executed by the GPU then the answer is in the middle.

Nsight VSE CUDA Trace and Visual Profiler have additional visualizers for tracing CDP grids. The video (but not slides) from the GTC 2013 presentation Profiling and Optimizing CUDA Kernel Code with NVIDIA Nsight Visual Studio Edition provide the best documentation on CDP visualization. Start watching at time 17:15.

4
Avi Ginsburg On

No. All threads in a warp execute in lock-step, so if thread 0 has not yet finished, neither have threads [1..31]. Other threads (or warps) in the block may or may not have finished executing.