I am trying to implement an algorithm involving large dense matrices in Matlab. I am using multi-GPU AWS instances for performance.

At each iteration, I have to work with two large `m` by `n` matrices (of doubles), A and B, where `m = 1600000`, and `n = 500`. Due to the size of the matrices and the memory capacity of each GPU (~8 GB memory each), I decompose the problem by partitioning the matrices row-wise into `K` chunks of smaller matrices who has the same number of `n` columns but fewer rows (`M /K`).

In theory, I can load each chunk of data onto the GPU one at a time, perform computations, and gather the data before repeating with the next chunk. However, since I have access to 4 GPUs, I would like to use all 4 GPUs in parallel to save time, and decompose the matrices into 4 chunks.

To achieve this, I tried using the `parfor` loop in Matlab (with the parallel computing toolbox), utilizing best practices such as slicing, loading only relevant data for each worker. For posterity, here is a complete code snippet. I have provided small, decomposed problems deeper down in this post.

``````M = 1600000;
K = 4;
m = M/K;
n = 500;
A = randn(K, m,n);
B = randn(K,m,n);
C = randn(n,2);
D = zeros(K,m,2);

%delete(gcp('nocreate'));
%p = parpool('local',K);

tic
toc_compute = zeros(K,1);
parfor j = 1:K
tic
A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
C_blk = gpuArray(C);
D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
tic
B_blk = D_blk * C_blk' + A_blk + B_blk;
toc_compute(j) = toc;
tic
B(j,:,:) = gather(B_blk);
end
toc_all = toc;
fprintf('averaged over 4 workers, computation on GPU took %f seconds \n',mean(toc_compute));
fprintf('the entire process took %f seconds \n', toc_all);
``````

Using the tic-toc time checker (I run the code only after starting the parpool to ensure that time-tracker is accurate), I found that each worker takes on average:

• 6.33 seconds to load the data onto the GPU
• 0.18 seconds to run the computations on the GPU
• 4.91 seconds to unload the data from the GPU.

However, the entire process takes 158.57 seconds. So, the communication overhead (or something else?) took up a significant chunk of the running time.

I then tried a simple for loop without parallelization, see snippet below.

``````%% for loop
tic
for j = 1:K
A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
C_blk = gpuArray(C);
D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
B_blk = D_blk * C_blk' + A_blk + B_blk;
toc_compute(j) = toc;
B(j,:,:) = gather(B_blk);
end
toc_all = toc;
fprintf('the entire process took %f seconds \n', toc_all);
``````

This time, running the entire code took only 27.96 seconds. So running the code in serial significantly improved performance in this case. Nonetheless, given that I have 4 GPUs, it seems disappointing to not be able to gain a speedup by using all 4 at the same time.

From my experiments above, I have observed that the actual computational cost of the GPU working on the linear algebra tasks appears low. The key bottleneck appears to be the time taken in loading the data in parallel from CPU onto the multiple GPUs, and gathering the data from the multiple GPUs back to CPU, though it is also possible that there is some other factor in play.

In lieu of this, I have the following questions:

1. What exactly is underlying the slowness of `parfor`? Why is the communication overhead (or whatever the underlying reason) so expensive?

2. How can I speed up the parallel loading and unloading of data from CPU to multiple GPUs and then back in Matlab? Are there tricks involving `parfor`, `spmd` (or other things such as `parfeval`, which I have not tried) that I have neglected? Or have I reached some kind of fundamental speed limit in Matlab (assuming I maintain my current CPU/GPU setup) ?

3. If there is a fundamental limitation in how Matlab handles the data loading/unloading, would the only recourse be to rewrite this portion of the code in C++?

Thank you for any assistance!

Sending data to/from AWS instances to use with `parfor` is considerably slower than using workers on your local machine because (a) the machines are further away, and (b) there's additional overhead because all communication with AWS workers use secure communication.
You can use `ticBytes` and `tocBytes` to see how much data is being transferred.
Precisely how you avoid data transfer is highly dependent on where your original fundamental data is coming from. If you have files on your client system... that's tough. In your example, you're using `rand` - which is easy to run on the cluster, but presumably not really representative.
Sometimes there's a middle ground where you have some small-ish fundamental data that can only be computed at the client, and large derived data that is needed on the workers. In that case, you might conceivably couple the computation with `parallel.pool.Constant`, or just do everything inside a single `spmd` block or something. (Your `parfor` loop as written could equally use `spmd` since you're arranging things to have one iteration per worker).