Copy portion of global array to local memory

2.3k views Asked by At

I'm using PyOpenCL to let my GPU do some regression on a large data set. Right now the GPU is slower than the CPU, probably because there is a loop that requires access to the global memory during each increment (I think...). The data set is too large to store into the local memory, but each loop does not require the entire data set, so I want to copy a portion of this array to the local memory. My question is: how do I do this? In Python one can easily slice a portion, but I don't think that's possible in OpenCL.

Here's the OpenCL code I'm using, if you spot any more potential optimisations, please shout:

__kernel void gpu_slope(__global double * data, __global double * time, __global int * win_results, const unsigned int N, const unsigned int Nmax, const double e, __global double * result) {
    __local unsigned int n, length, leftlim, rightlim, i;
    __local double sumx, sumy, x, y, xx, xy, invlen, a, b;

    n = get_global_id(0);

    leftlim = win_results[n*2];
    rightlim = win_results[n*2+1];

    sumx = 0;
    sumy = 0;
    xy = 0;
    xx = 0;
    length = rightlim - leftlim;

    for(i = leftlim; i <= rightlim; i++) {
        x = time[i];   /* I think this is fetched from global memory */
        y = data[i];
        sumx += x;
        sumy += y;
        xy += x*y;
        xx += x*x;
    }

    invlen = 1.0/length;
    a = xy-(sumx*sumy)*invlen;
    b = xx-(sumx*sumx)*invlen;
    result[n] = a/b;
}

I'm new to OpenCL, so please bear with me. Thanks!

3

There are 3 answers

1
einpoklum On BEST ANSWER

The main(ish) point in GPU computing is trying to utilize hardware parallelism as much as possible. Instead of using the loop, launch a kernel with a different thread for every one of the coordinates. Then, either use atomic operations (the quick-to-code, but slow-performance option), or parallel reduction, for the various sums.

AMD has A tutorial on this subject. (NVidia does too, but theirs would be CUDA-based...)

0
benshope On

You will find examples copying to local memory in PyOpenCL's examples folder: https://github.com/inducer/pyopencl/tree/master/examples I recommend you read, run, and customize several of these examples to learn.

I also recommend the Udacity parallel programming course: https://www.udacity.com/course/cs344 This course will help solidify your grasp of fundamental OpenCL concepts.

0
Nick Hockings On

What you would need to do is (1) set a __local space argument for the kernel, (2) use the kernels to load the image patch into the __local array (efficiently, without repetition between threads) (3) then access the __local array for the computation you wish to perform.

For more information see section 4.6.1 'local arguments' and 8.3.3 'local space arguments' in 'OpenCL in Action' by Matthew Scarpino, available online with example code at https://www.manning.com/books/opencl-in-action. (I have no connection to this publication, other than studying it myself.)