I have written a piece of code in MatLab (2018a) that is a hybrid between standard matlab code and CUDA code, which I have linked using compilation with mexcuda. The core loop in my code contains an interpolation of a matrix, say from size [n x m] to [N x M]. I have sped up this part using the GPU.
Since this interpolation is within a loop, and since the sizes of the matrices I interpolate (both before and after) are the same in each iteration of the loop, I want to speed up the application by preallocating an array of the output size on the GPU. So I want to do something like: zeros(N,M,'gpuArray')
once at the start, provide it as input to the mexFunction, and write the interpolated matrix to this array. This would save quite a bit of allocation time ( [N_iterations-1]*allocation_time, roughly speaking ).
My problem now is: I cannot figure out if this is possible. Using a mexFunction() as the entry point, the only way I know to retrieve input arrays is using something along the lines of:
mxGPUArray const *in = mxGPUCreateFromMxArray(prhs[0]);
float const *dev_in = (float const *)(mxGPUGetDataReadOnly(in));
but, as the name suggests, this results in read-only permission. I cannot use mxGPUGetData(in)
because the mxGPUArray is const
, one cannot initialise a non-const entity with it. Does anyone know if there is a way around this issue that does not involve allocation of the array inside the mexFunction?
EDIT:
The code below shows two C code samples, where the first is an analogy for my current code, and the second is what I am striving for:
Current:
#include "stdio.h"
int main(const int argc, const char *argv[]) {
// Allocate input matrix and fill from input arguments
FILE *fPtr; fPtr = fopen(argv[1],"rb");
double *mat_in = malloc(n*m*sizeof(*mat_in));
mat_in = fread(mat_in, sizeof(*mat_in), n*m, fPtr);
fclose(fPtr);
double *mat_out;
for (int it = 0, it < 1000, it++) {
// Allocate output array and fill it;
mat_out = malloc(N*M*sizeof(*mat_out));
interpolation_function(mat_in, mat_out);
// Do stuff with mat_out
free(mat_out);
}
// Free mat_in, do more stuff and/or exit program
Idea:
#include "stdio.h"
int main(const int argc, const char *argv[]) {
// Allocate input matrix and fill from input arguments
FILE *fPtr; fPtr = fopen(argv[1],"rb");
double *mat_in = malloc(n*m*sizeof(*mat_in));
mat_in = fread(mat_in, sizeof(*mat_in), n*m, fPtr);
fclose(fPtr);
// Allocate output array once at the start:
double *mat_out = malloc(N*M*sizeof(*mat_out));
for (int it = 0, it < 1000, it++) {
interpolation_function(mat_in, mat_out); // Fills mat_out
// Do stuff with mat_out here;
}
free(mat_out);
// Free mat_in, do more stuff and/or exit program
The above two are (at least in my mind) an analogy for the following matlab-cuda hybrid code:
Current (matlab); the mexcuda function needs to allocate memory for the interpolation of input(:,:,indx)
accumresult = zeros(N,M);
input = randn(100,100,1000);
for indx = 1:1000
input_slice = mexcuda_interpolation( input(:,:,indx) );
accumresult = accumresult + foo( input_slice, other_parameters);
end
Idea: the memory allocation is moved out of the mexcuda function (and thus, out of the core loop), and the mexcuda function only needs to retrieve the pointer to this (writeable) array;
accumresult = zeros(N,M,'gpuArray');
placeholder = zeros(N,M,'gpuArray'); % Memory allocated on GPU once here
input = randn(100,100,1000);
for indx = 1:1000
accumresult = accumresult + foo( mexcuda_interpolation(input(:,:,indx)), placeholder, other_parameters);
%mexcuda_interpolation() somehow gets a pointer to the allocated memory which it can write to
end
Note that there is indeed a possibility to parallelise this further: as stated, I am in an intermediate step of parallelising the entire thing.
For your mex code, use
mxGPUCreateGPUArray
, instead ofmxGPUCreateFromMxArray
for memory allocation without initialization.About your MATLAB code: why are you preallocating? Understand the principles of what you are doing, because you need it to work with GPUs.
In MATLAB, if you don't preallocate, every time you append new data, what MATLAB does under the hood is: create new array with new size, copy data from smaller old array to new. Of course, this is disencouraged, as you are doing unnecessary copies all the time.
In CUDA, this is not possible. Dynamic arrays do not exist. Especially because whatever you are doing does not happen serially, in a for loop, it happens "at the same time". Therefore it is fundamental that you know the size of the output when you do an operation.
So when you have GPU arrays
A
andB
and you operatef()
on them,f
needs to know the output size. If you are doing in MATLABC=f(A,B)
, you do not need to preallocateC
(in fact, with this example, neither do you, without GPU computing). MATLAB will be smart enough to do it for you.So, either you need to see why in the following code preallocating
C
is a waste of timeOr alternatively you have a code that looks like:
Which fundamentally means that you are not getting any benefit from GPU computing as you are separating the parallelizable elements.