I am trying to implement an algorithm involving large dense matrices in Matlab. I am using multi-GPU AWS instances for performance.

At each iteration, I have to work with two large m by n matrices (of doubles), A and B, where m = 1600000, and n = 500. Due to the size of the matrices and the memory capacity of each GPU (~8 GB memory each), I decompose the problem by partitioning the matrices row-wise into K chunks of smaller matrices who has the same number of n columns but fewer rows (M /K).

In theory, I can load each chunk of data onto the GPU one at a time, perform computations, and gather the data before repeating with the next chunk. However, since I have access to 4 GPUs, I would like to use all 4 GPUs in parallel to save time, and decompose the matrices into 4 chunks.

To achieve this, I tried using the parfor loop in Matlab (with the parallel computing toolbox), utilizing best practices such as slicing, loading only relevant data for each worker. For posterity, here is a complete code snippet. I have provided small, decomposed problems deeper down in this post.

M = 1600000;
K = 4;
m = M/K;
n = 500;
A = randn(K, m,n);
B = randn(K,m,n);
C = randn(n,2);
D = zeros(K,m,2);

%delete(gcp('nocreate'));
%p = parpool('local',K);

tic
toc_load = zeros(K,1);
toc_compute = zeros(K,1);
toc_unload = zeros(K,1);
parfor j = 1:K
    tic
    A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
    B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
    C_blk = gpuArray(C);
    D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
    toc_load(j)  = toc;
    tic
    B_blk = D_blk * C_blk' + A_blk + B_blk;
    toc_compute(j) = toc;
    tic
    B(j,:,:) = gather(B_blk);
    toc_unload(j) = toc;
end
toc_all = toc;
fprintf('averaged over 4 workers, loading onto GPU took %f seconds \n', mean(toc_load));
fprintf('averaged over 4 workers, computation on GPU took %f seconds \n',mean(toc_compute));
fprintf('averaged over 4 workers, unloading from GPU took %f seconds \n', mean(toc_unload));
fprintf('the entire process took %f seconds \n', toc_all);

Using the tic-toc time checker (I run the code only after starting the parpool to ensure that time-tracker is accurate), I found that each worker takes on average:

  • 6.33 seconds to load the data onto the GPU
  • 0.18 seconds to run the computations on the GPU
  • 4.91 seconds to unload the data from the GPU.

However, the entire process takes 158.57 seconds. So, the communication overhead (or something else?) took up a significant chunk of the running time.

I then tried a simple for loop without parallelization, see snippet below.

%% for loop
tic
for j = 1:K
    A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
    B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
    C_blk = gpuArray(C);
    D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
    toc_load(j)  = toc;
    B_blk = D_blk * C_blk' + A_blk + B_blk;
    toc_compute(j) = toc;
    B(j,:,:) = gather(B_blk);
end
toc_all = toc;
fprintf('the entire process took %f seconds \n', toc_all);

This time, running the entire code took only 27.96 seconds. So running the code in serial significantly improved performance in this case. Nonetheless, given that I have 4 GPUs, it seems disappointing to not be able to gain a speedup by using all 4 at the same time.

From my experiments above, I have observed that the actual computational cost of the GPU working on the linear algebra tasks appears low. The key bottleneck appears to be the time taken in loading the data in parallel from CPU onto the multiple GPUs, and gathering the data from the multiple GPUs back to CPU, though it is also possible that there is some other factor in play.

In lieu of this, I have the following questions:

  1. What exactly is underlying the slowness of parfor? Why is the communication overhead (or whatever the underlying reason) so expensive?

  2. How can I speed up the parallel loading and unloading of data from CPU to multiple GPUs and then back in Matlab? Are there tricks involving parfor, spmd (or other things such as parfeval, which I have not tried) that I have neglected? Or have I reached some kind of fundamental speed limit in Matlab (assuming I maintain my current CPU/GPU setup) ?

  3. If there is a fundamental limitation in how Matlab handles the data loading/unloading, would the only recourse be to rewrite this portion of the code in C++?

Thank you for any assistance!

1 Answers

0
Edric On

Sending data to/from AWS instances to use with parfor is considerably slower than using workers on your local machine because (a) the machines are further away, and (b) there's additional overhead because all communication with AWS workers use secure communication.

You can use ticBytes and tocBytes to see how much data is being transferred.

To improve the performance, I would suggest doing everything possible to avoid transferring large amounts of data between your client and the workers. It can often be more efficient to build data directly on the workers, even if this means building arrays redundantly multiple times.

Precisely how you avoid data transfer is highly dependent on where your original fundamental data is coming from. If you have files on your client system... that's tough. In your example, you're using rand - which is easy to run on the cluster, but presumably not really representative.

Sometimes there's a middle ground where you have some small-ish fundamental data that can only be computed at the client, and large derived data that is needed on the workers. In that case, you might conceivably couple the computation with parallel.pool.Constant, or just do everything inside a single spmd block or something. (Your parfor loop as written could equally use spmd since you're arranging things to have one iteration per worker).