Why is bsxfun on a gpuArray so slow?

520 views Asked by At

I have MATLAB 2016a running on a Win 10 machine with stock i5 2500K and 2 GTX 970. I'm a novice at GPU computing and I'm exploring how to speed up my computations with my GPU(s).

So I run the following simple code:

clear;

A = randn(1000,1);
B = randn(100,1);
n = 10000;

gA = gpuArray(A);
gB = gpuArray(B);

myfunc = @(a,b)(a.*b);

tic;
for i = 1:n
    C = bsxfun(myfunc,A,B');
end
disp(toc);

tic;
for i = 1:n
    C = gather(bsxfun(myfunc,gA,gB'));
end
disp(toc);

I get 8.2 (seconds) and 321.3864 (seconds) respectively.

clear;

A = randn(1000,1);
B = randn(100,1);
n = 10000;

gA = gpuArray(A);
gB = gpuArray(B);

myfunc = @(a,b)(a.*b);

tic;
parfor i = 1:n
    C = bsxfun(myfunc,A,B');
end
disp(toc);

tic;
parfor i = 1:n
    C = gather(bsxfun(myfunc,gA,gB'));
end
disp(toc);

(Difference: for --> parfor). I get around 2.7 (seconds) and 6.3 (seconds).

Why is the GPU approach slower in both cases please? In my work, myfunc is a lot more complicated. I have defined it so that it works well with the non-GPU bsxfun but when I GPU-ize like I have done above, I encounter the error Use of functional workspace is not supported. (In my work, myfunc is defined inside and at the start of the parfor loop.) Could you please also explain what this error is indicating?

2

There are 2 answers

0
Dev-iL On BEST ANSWER

Let me start by saying that GPUs are not some magical objects that can somehow increase the rate of any computation. They are a tool that is good for certain jobs, and they have limitations that need to be considered. The rule of thumb for GPUs is that mathematical operations are "cheaper" than memory access, so for example, code written for the GPU might run better if you recompute some array every time it's needed instead of saving it to a temporary variable and accessing it. The bottom line - GPU coding requires a bit of a different thinking, and these things are outside the scope of the present answer.


Here's a list of things that can be improved:

1. Random number generation:

Generating random numbers is far more efficient on a GPU, not to mention that it saves you the costly communication overhead. MATLAB provides us with several convenience functions to establish arrays on a GPU. In other words,

A = randn(1000,1);
gA = gpuArray(A);

can be replaced with:

gA = gpuArray.randn(1000,1);

2. Re-definition of existing functions for bsxfun:

There's no need to do it. Take a look at the list of builtin functions supported by bsxfun: .* or times is already one of them! Thus, you can replace:

myfunc = @(a,b)(a.*b);
...
bsxfun(myfunc,A,B');

with:

bsxfun(@times,A,B.');

(or in MATLAB releases >= R2016b: A.*B.').

Also, it's better for performance to define your custom function as a nested function within your script file and call it using @myFunc, i.e.:

function main
...
bsxfun(@myFunc,A,B')

% later in the same file, or in a completely different one:
function out = myFunc(a,b)
out = ...

3. Using ctranspose instead of transpose:

This is explained very well here. Long story short: you should get into the habit of using .' for transpose, and ' for complex-conjugate transpose.

4. Timing function executions:

Long story short: tic & toc are not a good indication usually, use timeit instead.

5. Creating a parallel pool implicitly:

This is a fairly minor comment: in the 2nd code snippet you use parfor without calling parpool first. This means that if the pool is not created at that stage, the creation time (of several seconds) will be added to the timing reported by tic/toc. To avoid this, follow the programming principle of "Explicit is better than implicit" and just call parpool beforehand.

6. Compare apples to apples:

These two lines of code do not do the same amount of work:

C = bsxfun(myfunc,A,B');
C = gather(bsxfun(myfunc,gA,gB'));

Why's that? Because the 2nd version also has to transfer the result of bsxfun from GPU memory to RAM - which is not free (in terms of runtime). In the present example, this means that you're adding a transfer of ~800KB of data to every iteration. I'm assuming that your actual problem has larger matrices, so you understand this overhead becomes serious quite fast.

7. Keeping variables you don't need:

Another minor comment: instead of doing:

parfor i = 1:n % or "for"
    C = bsxfun(myfunc,A,B');
end

You can do:

parfor i = 1:n % or "for"
    [~] = bsxfun(myfunc,A,B');
end

As for the error, I cannot reproduce it on my R2016b, but it sounds like some problem that is related to the incompatibility of capturing (i.e. the mechanism whereby a snapshot of the variables used in an anonymous function is taken when it is created) with the slicing needed for parfor. I don't know what exactly you're doing wrong, but it sounds to me that you shouldn't define a function inside a parfor iteration. Perhaps these posts could help: 1, 2.

0
Joss Knight On

I ran your code on my desktop with a Tesla K20 and Quadro K620 and my laptop with a GTX 9XXM chip. In neither case did I get anything remotely as bad as your first set of numbers, and such a big discrepancy is not indicated by the stats for the two GeForce cards (well, your card and my chip). Perhaps your first set of timings were skewed by some overhead? Certainly your second set of timings would seem to prove that because the time improves by too much.

Still, I can answer the general point. Why are you calling gather? Most of the cost is in the data transfer, cf:

>> gputimeit(@()gA.*gB')
ans = 
    2.6918e-04

>> gputimeit(@()gather(gA.*gB'))
ans =
    0.0011

So data transfer of 100 thousand elements (0.8MB) is 11 milliseconds compared to the 0.3ms cost of the actual computation. (Note here that I'm not using bsxfun because MATLAB R2016b does automatic dimension expansion for element-wise operations, which eliminates the need for it.)

Unless you need to display it or write it to disk (or run an operation on it that doesn't support GPU computation), you never need to gather a gpuArray, so don't, leave it on the device. The CPU isn't hobbled by this data transfer issue, so of course it looks good in comparison.

Another point is that GeForce cards have terrible performance for double precision - your GTX 970 has for instance a reported performance of 3494 GFlops for single precision but a weedy 109 GFlops for double. They're optimised for display you see. However, switching to single precision won't make much difference in this case, since the operation is data transfer bound first, and memory bandwidth-bound second. Compute time doesn't really come into it.

As far as parallelizing is concerned, I question your numbers because the improvement is too good for just two GPUs. However, you have two GPUs, so they can do data transfer (and computation) in parallel, so you get an improvement. Not enough to offset the overhead in your case though, it seems.