Possible to speed up this gpuArray calculation with arrayfun() (or otherwise)?

Question

Possible to speed up this gpuArray calculation with arrayfun() (or otherwise)?

222 views Asked by teeeeee At 18 January 2021 at 16:50

I have a complex matrix A, and would like to modify it Nt times according to A = exp( -1i*(A + abs(A).^2) ). The size of A is typically 1000x1000, and the number of times to run would be around 10000.

I am looking to reduce the time taken to carry out these operations. For 1000 iterations on the CPU, I measure around 6.4 seconds. Following the Matlab documentation, I was able to move this to the GPU, which reduced the time taken to 0.07 seconds (an incredible x91 improvement!). So far so good.

However, I also now read this link in the docs, which describes how we can sometimes find even further improvement for element-wise calculations if we use arrayfun() as well. If I try to follow the tutorial, the time taken is actually worse, clocking in at 0.47 seconds. My tests are shown below:

Nt = 1000; % Number of times to run each method
test_functionFcn = @test_function;

A = rand( 500, 600, 'double' ) + rand( 500, 600, 'double' )*1i; % Define an initial complex matrix
    
gpu_A = gpuArray(A); % Transfer matrix to a GPU array

%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%%%%%%%%%
cpu_data_out = A;
tic
for k = 1:Nt 
    cpu_data_out = test_function( cpu_data_out );
end
tcpu = toc;

%%%%%%%%%%%%%%%%% Run the calculation Nt times on GPU directly %%%%%%%%%%%%%%%%%%%%
gpu_data_out = gpu_A;
tic
for k = 1:Nt
    gpu_data_out = test_function(gpu_data_out);
end
tgpu = toc;

%%%%%%%%%%%%%% Run the calculation Nt times on GPU using arrayfun() %%%%%%%%%%%%%%
gpuarrayfun_data_out = gpu_A;
tic
for k = 1:Nt
    gpuarrayfun_data_out = arrayfun( test_functionFcn, gpuarrayfun_data_out );
end
tgpu_arrayfun = toc;

%%% Print results %%%
fprintf( 'Time taken using only CPU: %g\n', tcpu );
fprintf( 'Time taken using gpuArray directly: %g\n', tgpu );
fprintf( 'Time taken using GPU + arrayfun(): %g\n', tgpu_arrayfun );

%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
end

and the results are:

Time taken using only CPU: 6.38785
Time taken using gpuArray directly: 0.0680587
Time taken using GPU + arrayfun(): 0.474612

My questions are:

Am I using arrayfun() correctly in this situation, and it is expected that arrayfun() should be worse?
If so, and it is really just expected that it is slower than the direct gpuArray method, is there any easy (i.e non-MEX) way to speed up such a calculation? (I see they also mention using pagefun for example).

Thanks in advance for any advice.

(The graphics card is Nvidia Quadro M4000, and I am running Matlab R2017a)

Edit

After reading @Edric's answer, I think it is important to show a little more of the wider code. One thing I didn't mention in the OP is that in my actual main code, is that inside the k=1:Nt loop there is an additional operation which is a matrix multiplication with the transpose of a sparse, tridiagonal matrix. Here is a more fleshed out MWE of what is really going on:

Nt = 1000; % Number of times to run each method
N_rows = 500;
N_cols = 600;
test_functionFcn = @test_function;
A = rand( N_rows, N_cols, 'double' ) + rand( N_rows, N_cols, 'double' )*1i; % Define an initial complex matrix
%%% Generate a sparse, tridiagonal, square transformation matrix %%%%%%%%
mm = 10*ones(N_cols,1); % Subdiagonal elements
dd = 20*ones(N_cols,1); % Main diagonal elements
pp = 30*ones(N_cols,1); % Superdiagonal elements
M = spdiags([mm dd pp],-1:1,N_cols,N_cols);
M(1,1) = 6; % Set a couple of other entries
M(2,1) = 3;
%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%
cpu_data_out = A;
for k = 1:Nt 
    cpu_data_out = test_function( cpu_data_out );
    cpu_data_out = cpu_data_out*M.';
end
%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
end

I'm very sorry for not including that in the OP - I did not realise at the time that it might be relevant to the solution. Does this change things? Are there still gains to be made with arrayfun() on the GPU, or is this now not suitable for converting to arrayfun() ?

Original Q&A

There are 1 answers

**Edric** · Answer 1 · 2021-01-19T07:16:27+00:00

A few points here. Firstly, (and most importantly), to time code on the GPU, you need to use either gputimeit, or you need to inject a call to wait(gpuDevice) before calling toc. That's because work is launched asynchronously on the GPU, and you only get accurate timings by waiting for it to finish. With those minor modifications, on my GPU, I see 0.09 seconds for the gpuArray method, and 0.18 seconds for the arrayfun version.

Running a loop of GPU operations is generally inefficient, so the main gain you can get here is by pushing the loop inside the arrayfun function body so that that loop runs directly on the GPU. Like this:

%%% Function to operate on matrices %%%
function x = test_function(x,Nt)
for ii = 1:Nt
    x = exp(-1i*(x + abs(x).^2));
end
end

You'll need to invoke it like A = arrayfun(@test_function, A, Nt). On my GPU, this brings the arrayfun time down to 0.05 seconds, so about twice as fast as the plain gpuArray version.

TechQA.

Possible to speed up this gpuArray calculation with arrayfun() (or otherwise)?

Edit

There are 1 answers

Related Questions in PERFORMANCE

Related Questions in MATLAB

Related Questions in GPU

Related Questions in GPUARRAY

Popular Questions

Trending Questions