I have a complex matrix A
, and would like to modify it Nt
times according to A = exp( -1i*(A + abs(A).^2) )
. The size of A
is typically 1000x1000, and the number of times to run would be around 10000.
I am looking to reduce the time taken to carry out these operations. For 1000 iterations on the CPU, I measure around 6.4 seconds. Following the Matlab documentation, I was able to move this to the GPU, which reduced the time taken to 0.07 seconds (an incredible x91 improvement!). So far so good.
However, I also now read this link in the docs, which describes how we can sometimes find even further improvement for element-wise calculations if we use arrayfun()
as well. If I try to follow the tutorial, the time taken is actually worse, clocking in at 0.47 seconds. My tests are shown below:
Nt = 1000; % Number of times to run each method
test_functionFcn = @test_function;
A = rand( 500, 600, 'double' ) + rand( 500, 600, 'double' )*1i; % Define an initial complex matrix
gpu_A = gpuArray(A); % Transfer matrix to a GPU array
%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%%%%%%%%%
cpu_data_out = A;
tic
for k = 1:Nt
cpu_data_out = test_function( cpu_data_out );
end
tcpu = toc;
%%%%%%%%%%%%%%%%% Run the calculation Nt times on GPU directly %%%%%%%%%%%%%%%%%%%%
gpu_data_out = gpu_A;
tic
for k = 1:Nt
gpu_data_out = test_function(gpu_data_out);
end
tgpu = toc;
%%%%%%%%%%%%%% Run the calculation Nt times on GPU using arrayfun() %%%%%%%%%%%%%%
gpuarrayfun_data_out = gpu_A;
tic
for k = 1:Nt
gpuarrayfun_data_out = arrayfun( test_functionFcn, gpuarrayfun_data_out );
end
tgpu_arrayfun = toc;
%%% Print results %%%
fprintf( 'Time taken using only CPU: %g\n', tcpu );
fprintf( 'Time taken using gpuArray directly: %g\n', tgpu );
fprintf( 'Time taken using GPU + arrayfun(): %g\n', tgpu_arrayfun );
%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
end
and the results are:
Time taken using only CPU: 6.38785
Time taken using gpuArray directly: 0.0680587
Time taken using GPU + arrayfun(): 0.474612
My questions are:
- Am I using arrayfun() correctly in this situation, and it is expected that arrayfun() should be worse?
- If so, and it is really just expected that it is slower than the direct gpuArray method, is there any easy (i.e non-MEX) way to speed up such a calculation? (I see they also mention using pagefun for example).
Thanks in advance for any advice.
(The graphics card is Nvidia Quadro M4000, and I am running Matlab R2017a)
Edit
After reading @Edric's answer, I think it is important to show a little more of the wider code. One thing I didn't mention in the OP is that in my actual main code, is that inside the k=1:Nt loop there is an additional operation which is a matrix multiplication with the transpose of a sparse, tridiagonal matrix. Here is a more fleshed out MWE of what is really going on:
Nt = 1000; % Number of times to run each method
N_rows = 500;
N_cols = 600;
test_functionFcn = @test_function;
A = rand( N_rows, N_cols, 'double' ) + rand( N_rows, N_cols, 'double' )*1i; % Define an initial complex matrix
%%% Generate a sparse, tridiagonal, square transformation matrix %%%%%%%%
mm = 10*ones(N_cols,1); % Subdiagonal elements
dd = 20*ones(N_cols,1); % Main diagonal elements
pp = 30*ones(N_cols,1); % Superdiagonal elements
M = spdiags([mm dd pp],-1:1,N_cols,N_cols);
M(1,1) = 6; % Set a couple of other entries
M(2,1) = 3;
%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%
cpu_data_out = A;
for k = 1:Nt
cpu_data_out = test_function( cpu_data_out );
cpu_data_out = cpu_data_out*M.';
end
%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
end
I'm very sorry for not including that in the OP - I did not realise at the time that it might be relevant to the solution. Does this change things? Are there still gains to be made with arrayfun() on the GPU, or is this now not suitable for converting to arrayfun() ?
A few points here. Firstly, (and most importantly), to time code on the GPU, you need to use either
gputimeit
, or you need to inject a call towait(gpuDevice)
before callingtoc
. That's because work is launched asynchronously on the GPU, and you only get accurate timings by waiting for it to finish. With those minor modifications, on my GPU, I see 0.09 seconds for thegpuArray
method, and 0.18 seconds for thearrayfun
version.Running a loop of GPU operations is generally inefficient, so the main gain you can get here is by pushing the loop inside the
arrayfun
function body so that that loop runs directly on the GPU. Like this:You'll need to invoke it like
A = arrayfun(@test_function, A, Nt)
. On my GPU, this brings thearrayfun
time down to 0.05 seconds, so about twice as fast as the plaingpuArray
version.