How do I wait for child kernels to finish in a parent kernel before executing the rest of the parent kernel in CUDA dynamic parallelism?

273 views Asked by At

So I need the runParatron children to fully finish before the next iteration of the for loop happens. Based on the results I am getting, I'm pretty sure that's not happening. For example, I have a print statement in runParatron that executes AFTER the first "[" is printed outside the for loop. I tried to run cudaDeviceSynchronize, but it wouldn't compile stating that host code can't be executed on device code, and that cudaDeviceSynchronize is undefined in device code. Is there any way to wait until the children kernels are done for this? I see other posts, examples, and tutorials using cudaDeviceSynchronize within kernels, so perhaps I am missing something basic? Help would be thoroughly appreciated.

__global__ void runMLP(double* x, double* outputs, double* weights, activation_function* A_Fs, int*     CIL, int layers, int bias, int* WLO, int* OLO) {

    if (CIL[0] > 511) {
        copyElements << <CIL[0] / 32, 32 >> > (outputs, x, CIL[0]);
        //I WOULD ALSO LIKE TO WAIT HERE
    }
    else
        for (int i = 0;i < CIL[0];i++) {
            outputs[i] = x[i];
        }

    for (int i = 1;i < layers;i++) {
        printf("----------------------Layer %d :: InputSize %d :: Layer weight offset %d ::     Layer output offset %d----------------------\n", i, CIL[i-1], WLO[i-1], OLO[i]);
        runParatron << < (CIL[i] / 32) + 1, 32 >> > (outputs + OLO[i - 1], outputs +     OLO[i], weights + WLO[i - 1], A_Fs[i], CIL[i - 1], CIL[i], bias);
        //cudaDeviceSynchronize(); //THIS IS WHERE I NEED TO WAIT UNTIL NEXT ITERATION
    }
    if (A_Fs[layers - 1] == SOFTMAX) {
        double* temp = outputs + OLO[layers - 1];
        printf("[");
        for (int i = 0;i < CIL[layers-1];i++) {
            printf("% d, ", temp[i]);
        }
        printf("]\n");
        double denom = 0;
        for (int i = 0;i < CIL[layers - 1];i++) {
            denom += temp[i];
        }
        if (denom < DBL_MIN)
            denom = DBL_MIN;
        for (int i = 0;i < CIL[layers - 1];i++) {
            temp[i] /= denom;
        }
    }
}

For example, here is the output where the "[" comes before the child kernel output:

//All Cell: starting lines are produced from child kernel
[Cell: 0 :: weightOffset 0 :: AF 2 //As you can see, there is the "[" here when it should be
Cell: 1 :: weightOffset 6 :: AF 2
Cell: 2 :: weightOffset 12 :: AF 2
Cell: 3 :: weightOffset 18 :: AF 2
-502657059,  2118981138,  1645236453, ] //Down here!
1

There are 1 answers

0
yugi957 On

So I added an atomic counter and incremented it by one at the end of each child kernel. Then I put a while loop after the child kernel call checking to see if the counter had reached the amount of calls I wanted to finish yet. This fixed it. Let me know if anyone needs code for or clarification.