I am trying to implement an algorithm involving large dense matrices in Matlab. I am using multi-GPU AWS instances for performance.
At each iteration, I have to work with two large
n matrices (of doubles), A and B, where
m = 1600000, and
n = 500. Due to the size of the matrices and the memory capacity of each GPU (~8 GB memory each), I decompose the problem by partitioning the matrices row-wise into
K chunks of smaller matrices who has the same number of
n columns but fewer rows (
In theory, I can load each chunk of data onto the GPU one at a time, perform computations, and gather the data before repeating with the next chunk. However, since I have access to 4 GPUs, I would like to use all 4 GPUs in parallel to save time, and decompose the matrices into 4 chunks.
To achieve this, I tried using the
parfor loop in Matlab (with the parallel computing toolbox), utilizing best practices such as slicing, loading only relevant data for each worker. For posterity, here is a complete code snippet. I have provided small, decomposed problems deeper down in this post.
M = 1600000; K = 4; m = M/K; n = 500; A = randn(K, m,n); B = randn(K,m,n); C = randn(n,2); D = zeros(K,m,2); %delete(gcp('nocreate')); %p = parpool('local',K); tic toc_load = zeros(K,1); toc_compute = zeros(K,1); toc_unload = zeros(K,1); parfor j = 1:K tic A_blk = gpuArray(reshape(A(j,:,:),[m,n])); B_blk = gpuArray(reshape(B(j,:,:), [m,n])); C_blk = gpuArray(C); D_blk = gpuArray(reshape(D(j,:,:), [m,2])); toc_load(j) = toc; tic B_blk = D_blk * C_blk' + A_blk + B_blk; toc_compute(j) = toc; tic B(j,:,:) = gather(B_blk); toc_unload(j) = toc; end toc_all = toc; fprintf('averaged over 4 workers, loading onto GPU took %f seconds \n', mean(toc_load)); fprintf('averaged over 4 workers, computation on GPU took %f seconds \n',mean(toc_compute)); fprintf('averaged over 4 workers, unloading from GPU took %f seconds \n', mean(toc_unload)); fprintf('the entire process took %f seconds \n', toc_all);
Using the tic-toc time checker (I run the code only after starting the parpool to ensure that time-tracker is accurate), I found that each worker takes on average:
- 6.33 seconds to load the data onto the GPU
- 0.18 seconds to run the computations on the GPU
- 4.91 seconds to unload the data from the GPU.
However, the entire process takes 158.57 seconds. So, the communication overhead (or something else?) took up a significant chunk of the running time.
I then tried a simple for loop without parallelization, see snippet below.
%% for loop tic for j = 1:K A_blk = gpuArray(reshape(A(j,:,:),[m,n])); B_blk = gpuArray(reshape(B(j,:,:), [m,n])); C_blk = gpuArray(C); D_blk = gpuArray(reshape(D(j,:,:), [m,2])); toc_load(j) = toc; B_blk = D_blk * C_blk' + A_blk + B_blk; toc_compute(j) = toc; B(j,:,:) = gather(B_blk); end toc_all = toc; fprintf('the entire process took %f seconds \n', toc_all);
This time, running the entire code took only 27.96 seconds. So running the code in serial significantly improved performance in this case. Nonetheless, given that I have 4 GPUs, it seems disappointing to not be able to gain a speedup by using all 4 at the same time.
From my experiments above, I have observed that the actual computational cost of the GPU working on the linear algebra tasks appears low. The key bottleneck appears to be the time taken in loading the data in parallel from CPU onto the multiple GPUs, and gathering the data from the multiple GPUs back to CPU, though it is also possible that there is some other factor in play.
In lieu of this, I have the following questions:
What exactly is underlying the slowness of
parfor? Why is the communication overhead (or whatever the underlying reason) so expensive?
How can I speed up the parallel loading and unloading of data from CPU to multiple GPUs and then back in Matlab? Are there tricks involving
spmd(or other things such as
parfeval, which I have not tried) that I have neglected? Or have I reached some kind of fundamental speed limit in Matlab (assuming I maintain my current CPU/GPU setup) ?
If there is a fundamental limitation in how Matlab handles the data loading/unloading, would the only recourse be to rewrite this portion of the code in C++?
Thank you for any assistance!