Code running on two GPUs does not reach concurrent execution and has irrelevant speedup

155 views Asked by At

I have code like this:

for(int i =0; i<2; i++)
{
    //initialization of memory and some variables
    ........
    ........
    RunDll(input image, output image); //function that calls kernel
}

Each iteration in the above loop is independent. I want to run them concurrently. So, I tried this:

for(int i =0; i<num_devices; i++)
{
    cudaSetDevice(i);
    //initialization of memory and some variables
    ........
    ........
    RunDll(input image, output image); 
    {
        RunBasicFBP_CUDA(parameters); //function that calls kernel 1

        xSegmentMetal(parameters); //CPU function

        RunBasicFP_CUDA(parameters);  //function that uses output of kernel 1 as input for kernel 2

        for (int idx_view = 0; idx_view < param.fbp.num_view; idx_view++)
        {
            for (int idx_bin = 1; idx_bin < param.fbp.num_bin-1; idx_bin++)
            {
                sino_diff[idx_view][idx_bin] = sino_org[idx_view][idx_bin] - sino_mask[idx_view][idx_bin];
            }
        }

        RunBasicFP_CUDA(parameters);
        if(some condition)
        {
            xInterpolateSinoLinear(parameters);  //CPU function
        }
        else
        {
            xInterpolateSinoPoly(parameters);  //CPU function
        }

        RunBasicFBP_CUDA( parameters );
    }
}

I am using 2 GTX 680 and I want to use these two devices concurrently. With the above code, I am not getting any speed-up. The processing time is almost the same as that when running on a single GPU.

How can I reach concurrent execution on the two available devices?

2

There are 2 answers

0
Tom On BEST ANSWER

In your comment you say:

RunDll has two kernels and they are being launched one by one. Kernels do have cudaThreadSynchronize()

Note that cudaThreadSynchronize() is equivalent to cudaDeviceSynchronize() (and the former is actually deprecated) which means that you will run on one GPU, synchronise, then run on the other GPU. Also note that cudaMemcpy() is a blocking routine, you would need the cudaMemcpyAsync() version to avoid all blocking (as pointed out by @JackOLantern in comments).

In general, you will need to post more details of what is inside RunDLL() since without that your questions does not have enough information to give a definitive answer. Ideally follow these guidelines.

0
Vitality On

In my answer to your previous post (Concurrently running two for loops with same number of loop cycles involving GPU and CPU tasks on two GPU), I pointed out that you would not have achieved a speedup of 2 when using 2 GPUs.

To explain why, let us consider the following code snippet

Kernel1<<<...,...>>>(...); // assume Kernel1 takes t1 seconds

// assume CPUFunction + cudaMemcpys take tCPU seconds
cudaMemcpy(...,...,...,cudaMemcpyDeviceToHost); // copy the results of Kernel1 to CPU
CPUFunction(...); // assume it takes tCPU seconds
cudaMemcpy(...,...,...,cudaMemcpyHostToDevice); // copy data from the CPU to Kernel2

Kernel2<<<...,...>>>(...); // assume it takes t2 seconds

It doesn't matter if I'm using cudaDeviceSynchronize() or cudaMemcpy to obtain synchronization.

The cost of executing the above code snippet in the for loop on one GPU only is

t1 + tCPU + t2 + t1 + tCPU + t2 = 2t1 + 2tCPU + 2t2

In the case of 2 GPUs, if you were able to achieve perfect concurrency of the execution of Kernel1 and Kernel2 on the two different GPUs, then the cost of executing the above code snippet would be

t1 (concurrent execution of Kernel1 on the two GPUs) + 2*tCPU (you need two calls to the CPU function, each for a different instance of the output of Kernel1) + t2 (concurrent execution of Kernel2 on the two GPUs)

Accordingly, the speedup achieved by using two GPUs instead of one would be

(2*(t1 + tCPU + t2))/(t1 + 2tCPU + t2)

When tCPU is equal to zero, the speedup becomes 2.

This is an expression of Amdahl's law.