I have code like this:
for(int i =0; i<2; i++)
{
//initialization of memory and some variables
........
........
RunDll(input image, output image); //function that calls kernel
}
Each iteration in the above loop is independent. I want to run them concurrently. So, I tried this:
for(int i =0; i<num_devices; i++)
{
cudaSetDevice(i);
//initialization of memory and some variables
........
........
RunDll(input image, output image);
{
RunBasicFBP_CUDA(parameters); //function that calls kernel 1
xSegmentMetal(parameters); //CPU function
RunBasicFP_CUDA(parameters); //function that uses output of kernel 1 as input for kernel 2
for (int idx_view = 0; idx_view < param.fbp.num_view; idx_view++)
{
for (int idx_bin = 1; idx_bin < param.fbp.num_bin-1; idx_bin++)
{
sino_diff[idx_view][idx_bin] = sino_org[idx_view][idx_bin] - sino_mask[idx_view][idx_bin];
}
}
RunBasicFP_CUDA(parameters);
if(some condition)
{
xInterpolateSinoLinear(parameters); //CPU function
}
else
{
xInterpolateSinoPoly(parameters); //CPU function
}
RunBasicFBP_CUDA( parameters );
}
}
I am using 2 GTX 680 and I want to use these two devices concurrently. With the above code, I am not getting any speed-up. The processing time is almost the same as that when running on a single GPU.
How can I reach concurrent execution on the two available devices?
In your comment you say:
Note that
cudaThreadSynchronize()is equivalent tocudaDeviceSynchronize()(and the former is actually deprecated) which means that you will run on one GPU, synchronise, then run on the other GPU. Also note thatcudaMemcpy()is a blocking routine, you would need thecudaMemcpyAsync()version to avoid all blocking (as pointed out by @JackOLantern in comments).In general, you will need to post more details of what is inside
RunDLL()since without that your questions does not have enough information to give a definitive answer. Ideally follow these guidelines.