Copy portion of global array to local memory

Question

Copy portion of global array to local memory

2.3k views Asked by MPA At 04 December 2013 at 19:04

I'm using PyOpenCL to let my GPU do some regression on a large data set. Right now the GPU is slower than the CPU, probably because there is a loop that requires access to the global memory during each increment (I think...). The data set is too large to store into the local memory, but each loop does not require the entire data set, so I want to copy a portion of this array to the local memory. My question is: how do I do this? In Python one can easily slice a portion, but I don't think that's possible in OpenCL.

Here's the OpenCL code I'm using, if you spot any more potential optimisations, please shout:

__kernel void gpu_slope(__global double * data, __global double * time, __global int * win_results, const unsigned int N, const unsigned int Nmax, const double e, __global double * result) {
    __local unsigned int n, length, leftlim, rightlim, i;
    __local double sumx, sumy, x, y, xx, xy, invlen, a, b;

    n = get_global_id(0);

    leftlim = win_results[n*2];
    rightlim = win_results[n*2+1];

    sumx = 0;
    sumy = 0;
    xy = 0;
    xx = 0;
    length = rightlim - leftlim;

    for(i = leftlim; i <= rightlim; i++) {
        x = time[i];   /* I think this is fetched from global memory */
        y = data[i];
        sumx += x;
        sumy += y;
        xy += x*y;
        xx += x*x;
    }

    invlen = 1.0/length;
    a = xy-(sumx*sumy)*invlen;
    b = xx-(sumx*sumx)*invlen;
    result[n] = a/b;
}

I'm new to OpenCL, so please bear with me. Thanks!

Original Q&A

There are 3 answers

benshope On 05 December 2013 at 02:35

You will find examples copying to local memory in PyOpenCL's examples folder: https://github.com/inducer/pyopencl/tree/master/examples I recommend you read, run, and customize several of these examples to learn.

I also recommend the Udacity parallel programming course: https://www.udacity.com/course/cs344 This course will help solidify your grasp of fundamental OpenCL concepts.

Nick Hockings On 23 April 2023 at 07:43

What you would need to do is (1) set a __local space argument for the kernel, (2) use the kernels to load the image patch into the __local array (efficiently, without repetition between threads) (3) then access the __local array for the computation you wish to perform.

For more information see section 4.6.1 'local arguments' and 8.3.3 'local space arguments' in 'OpenCL in Action' by Matthew Scarpino, available online with example code at https://www.manning.com/books/opencl-in-action. (I have no connection to this publication, other than studying it myself.)

**einpoklum** · Accepted Answer · 2013-12-04T23:26:47+00:00

The main(ish) point in GPU computing is trying to utilize hardware parallelism as much as possible. Instead of using the loop, launch a kernel with a different thread for every one of the coordinates. Then, either use atomic operations (the quick-to-code, but slow-performance option), or parallel reduction, for the various sums.

AMD has A tutorial on this subject. (NVidia does too, but theirs would be CUDA-based...)

TechQA.

Copy portion of global array to local memory

There are 3 answers

Related Questions in OPENCL

Related Questions in PYOPENCL

Popular Questions

Popular Tags

Trending Questions