I'm running a cuda kernel function on a multiple GPUs system, with 4
GPUs. I've expected them to be launched concurrently, but they are not. I measure the starting time of each kernel, and the second kernel starts after the first one finishes its execution. So launching the kernel on 4
GPUs is not faster than 1
single GPU.
How can I make them work concurrently?
This is my code:
cudaSetDevice(0);
GPU_kernel<<< gridDim, threadsPerBlock >>>(d_result_0, parameterA +(0*rateA), parameterB + (0*rateB));
cudaMemcpyAsync(h_result_0, d_result_0, mem_size_result, cudaMemcpyDeviceToHost);
cudaSetDevice(1);
GPU_kernel<<< gridDim, threadsPerBlock >>>(d_result_1, parameterA +(1*rateA), parameterB + (1*rateB));
cudaMemcpyAsync(h_result_1, d_result_1, mem_size_result, cudaMemcpyDeviceToHost);
cudaSetDevice(2);
GPU_kernel<<< gridDim, threadsPerBlock >>>(d_result_2, parameterA +(2*rateA), parameterB + (2*rateB));
cudaMemcpyAsync(h_result_2, d_result_2, mem_size_result, cudaMemcpyDeviceToHost);
cudaSetDevice(3);
GPU_kernel<<< gridDim, threadsPerBlock >>>(d_result_3, parameterA +(3*rateA), parameterB + (3*rateB));
cudaMemcpyAsync(h_result_3, d_result_3, mem_size_result, cudaMemcpyDeviceToHost);
I have done some experiments on achieving concurrent execution on a cluster of
4
Kepler K20c GPUs. I have considered8
test cases, whose corresponding codes along with the profiler timelines are reported below.Test case #1 - "Breadth-first" approach - synchronous copy
- Code -
- Profiler timeline -
As it can be seen, the use of
cudaMemcpy
does not enable achieving concurrency in copies, but concurrency is achieved in kernel execution.Test case #2 - "Depth-first" approach - synchronous copy
- Code -
- Profiler timeline -
This time, concurrency is not achieved neither within memory copies nor within kernel executions.
Test case #3 - "Depth-first" approach - asynchronous copy with streams
- Code -
- Profiler timeline -
Concurrency is achieved, as expected.
Test case #4 - "Depth-first" approach - asynchronous copy within default streams
- Code -
- Profiler timeline -
Despite using the default stream, concurrency is achieved.
Test case #5 - "Depth-first" approach - asynchronous copy within default stream and unique host
cudaMallocHost
ed vector- Code -
- Profiler timeline -
Concurrency is achieved once again.
Test case #6 - "Breadth-first" approach with asynchronous copy with streams
- Code -
- Profiler timeline -
Concurrency achieved, as in the corresponding "depth-first" approach.
Test case #7 - "Breadth-first" approach - asynchronous copy within default streams
- Code -
- Profiler timeline -
Concurrency is achieved, as in the corresponding "depth-first" approach.
Test case #8 - "Breadth-first" approach - asynchronous copy within the default stream and unique host
cudaMallocHost
ed vector- Code -
- Profiler timeline -
Concurrency is achieved, as in the corresponding "depth-first" approach.
Conclusion Using asynchronous copies guarantees concurrent executions, either using purposely created streams or using the default stream.
Note In all the above examples, I have taken care to provide enough work to do the GPUs, either in terms of copies and of computing tasks. Failing to provide enough work to the cluster may prevent observing concurrent executions.