I have code like this:
for(int i =0; i<2; i++)
{
//initialization of memory and some variables
........
........
RunDll(input image, output image); //function that calls kernel
}
Each iteration in the above loop is independent. I want to run them concurrently. So, I tried this:
for(int i =0; i<num_devices; i++)
{
cudaSetDevice(i);
//initialization of memory and some variables
........
........
RunDll(input image, output image);
{
RunBasicFBP_CUDA(parameters); //function that calls kernel 1
xSegmentMetal(parameters); //CPU function
RunBasicFP_CUDA(parameters); //function that uses output of kernel 1 as input for kernel 2
for (int idx_view = 0; idx_view < param.fbp.num_view; idx_view++)
{
for (int idx_bin = 1; idx_bin < param.fbp.num_bin-1; idx_bin++)
{
sino_diff[idx_view][idx_bin] = sino_org[idx_view][idx_bin] - sino_mask[idx_view][idx_bin];
}
}
RunBasicFP_CUDA(parameters);
if(some condition)
{
xInterpolateSinoLinear(parameters); //CPU function
}
else
{
xInterpolateSinoPoly(parameters); //CPU function
}
RunBasicFBP_CUDA( parameters );
}
}
I am using 2 GTX 680 and I want to use these two devices concurrently. With the above code, I am not getting any speed-up. The processing time is almost the same as that when running on a single GPU.
How can I reach concurrent execution on the two available devices?
In your comment you say:
Note that
cudaThreadSynchronize()
is equivalent tocudaDeviceSynchronize()
(and the former is actually deprecated) which means that you will run on one GPU, synchronise, then run on the other GPU. Also note thatcudaMemcpy()
is a blocking routine, you would need thecudaMemcpyAsync()
version to avoid all blocking (as pointed out by @JackOLantern in comments).In general, you will need to post more details of what is inside
RunDLL()
since without that your questions does not have enough information to give a definitive answer. Ideally follow these guidelines.