I've read about that the dynamic parallelism is supported in the newer version of CUDA, and I can call thrust functions like thrush::exclusive_scan
inside a kernel function with thrust::device
parameter.
__global__ void kernel(int* inarray, int n, int *result) {
extern __shared__ int s[];
int t = threadIdx.x;
s[t] = inarray[t];
__syncthreads();
thrust::exclusive_scan(thrust::device, s, n, result);
__syncthreads();
}
int main() {
// prep work
kernel<<<1, n, n * sizeof(int)>>>(inarray, n, result);
}
The thing I got confused is:
- When calling thrust function inside a kernel, does each thread call the function once and they all do a dynamic parallelism on the data?
- If they do, I only need one thread to call
thrust
so I can just do aif
tothreadIdx
; if not, how do threads in a block communicate with each other that the call to thrust has been done and they should just ignore it(this seems a little imaginary since there wouldn't be a systematical way to ensure from user's code). To summerize, what's exactly happening when I call thrust functions withthrust::device
parameter inside a kernel?
Every thread in your kernel that executes the thrust algorithm will execute a separate copy of your algorithm. The threads in your kernel do not cooperate on a single algorithm call.
If you have met all the requirements (HW/SW and compilation settings) for a CUDA dynamic parallelism (CDP) call, then each thread that encounters the thrust algorithm call will launch a CDP child kernel to perform the thrust algorithm (in that case, the threads in the CDP child kernel do cooperate). If not, each thread that encounters the thrust algorithm call will perform it as if you had specified
thrust::seq
instead ofthrust::device
.If you prefer to avoid the CDP activity in an otherwise CDP-capable environment, you can specify
thrust::seq
instead.If you intend, for example, that only a single copy of your thrust algorithm be executed, it will be necessary in your kernel code to ensure that only one thread calls it, for example:
or similar.
Questions around synchronization before/after the call are no different from ordinary CUDA code. If you need all threads in the block to wait for the thrust algorithm to complete, use e.g.
__syncthreads()
, (andcudaDeviceSynchronize()
in the CDP case).The information here may possibly be of interest as well.