I have MATLAB 2016a running on a Win 10 machine with stock i5 2500K and 2 GTX 970. I'm a novice at GPU computing and I'm exploring how to speed up my computations with my GPU(s).
So I run the following simple code:
clear;
A = randn(1000,1);
B = randn(100,1);
n = 10000;
gA = gpuArray(A);
gB = gpuArray(B);
myfunc = @(a,b)(a.*b);
tic;
for i = 1:n
C = bsxfun(myfunc,A,B');
end
disp(toc);
tic;
for i = 1:n
C = gather(bsxfun(myfunc,gA,gB'));
end
disp(toc);
I get 8.2 (seconds) and 321.3864 (seconds) respectively.
clear;
A = randn(1000,1);
B = randn(100,1);
n = 10000;
gA = gpuArray(A);
gB = gpuArray(B);
myfunc = @(a,b)(a.*b);
tic;
parfor i = 1:n
C = bsxfun(myfunc,A,B');
end
disp(toc);
tic;
parfor i = 1:n
C = gather(bsxfun(myfunc,gA,gB'));
end
disp(toc);
(Difference: for --> parfor). I get around 2.7 (seconds) and 6.3 (seconds).
Why is the GPU approach slower in both cases please? In my work, myfunc
is a lot more complicated. I have defined it so that it works well with the non-GPU bsxfun
but when I GPU-ize like I have done above, I encounter the error Use of functional workspace is not supported.
(In my work, myfunc
is defined inside and at the start of the parfor
loop.) Could you please also explain what this error is indicating?
Let me start by saying that GPUs are not some magical objects that can somehow increase the rate of any computation. They are a tool that is good for certain jobs, and they have limitations that need to be considered. The rule of thumb for GPUs is that mathematical operations are "cheaper" than memory access, so for example, code written for the GPU might run better if you recompute some array every time it's needed instead of saving it to a temporary variable and accessing it. The bottom line - GPU coding requires a bit of a different thinking, and these things are outside the scope of the present answer.
Here's a list of things that can be improved:
1. Random number generation:
Generating random numbers is far more efficient on a GPU, not to mention that it saves you the costly communication overhead. MATLAB provides us with several convenience functions to establish arrays on a GPU. In other words,
can be replaced with:
2. Re-definition of existing functions for
bsxfun
:There's no need to do it. Take a look at the list of builtin functions supported by
bsxfun
:.*
ortimes
is already one of them! Thus, you can replace:with:
(or in MATLAB releases >= R2016b:
A.*B.'
).Also, it's better for performance to define your custom function as a nested function within your script file and call it using
@myFunc
, i.e.:3. Using ctranspose instead of transpose:
This is explained very well here. Long story short: you should get into the habit of using
.'
for transpose, and'
for complex-conjugate transpose.4. Timing function executions:
Long story short:
tic
&toc
are not a good indication usually, usetimeit
instead.5. Creating a parallel pool implicitly:
This is a fairly minor comment: in the 2nd code snippet you use
parfor
without callingparpool
first. This means that if the pool is not created at that stage, the creation time (of several seconds) will be added to the timing reported bytic
/toc
. To avoid this, follow the programming principle of "Explicit is better than implicit" and just callparpool
beforehand.6. Compare apples to apples:
These two lines of code do not do the same amount of work:
Why's that? Because the 2nd version also has to transfer the result of
bsxfun
from GPU memory to RAM - which is not free (in terms of runtime). In the present example, this means that you're adding a transfer of ~800KB of data to every iteration. I'm assuming that your actual problem has larger matrices, so you understand this overhead becomes serious quite fast.7. Keeping variables you don't need:
Another minor comment: instead of doing:
You can do:
As for the error, I cannot reproduce it on my R2016b, but it sounds like some problem that is related to the incompatibility of capturing (i.e. the mechanism whereby a snapshot of the variables used in an anonymous function is taken when it is created) with the slicing needed for
parfor
. I don't know what exactly you're doing wrong, but it sounds to me that you shouldn't define a function inside aparfor
iteration. Perhaps these posts could help: 1, 2.