OpenCL (JOCL) - 2D calculus over two arrays in Kernel

954 views Asked by At

I'm asking this here because I thought I've understood how OpenCL works but... I think there are several things I don't get.

What I want to do is to get the difference between all the values of two arrays, then calculate the hypot and finally get the maximum hypot value, so If I have:

double[] arrA = new double[]{1,2,3}
double[] arrB = new double[]{6,7,8}

Calculate
dx1 = 1 - 1; dx2 = 2 - 1; dx3 = 3 - 1, dx4= 1 - 2;... dxLast = 3 - 3
dy1 = 6 - 6; dy2 = 7 - 6; dy3 = 8 - 6, dy4= 6 - 7;... dyLast = 8 - 8

(Extreme dx and dy will get 0, but i don't care about ignoring those cases by now)

Then calculate each hypot based on hypot(dx(i), dy(i)) And once all these values where obtained, get the maximum hypot value

So, I have the next defined Kernel:

String programSource =
    "#ifdef cl_khr_fp64 \n"
+ "   #pragma OPENCL EXTENSION cl_khr_fp64 : enable \n"
+ "#elif defined(cl_amd_fp64) \n"
+ "   #pragma OPENCL EXTENSION cl_amd_fp64 : enable \n"
+ "#else "
+ "   #error Double precision floating point not supported by OpenCL implementation.\n"
+ "#endif \n"
+ "__kernel void "
+ "sampleKernel(__global const double *bufferX,"
+ "             __global const double *bufferY,"
+ "             __local double* scratch,"
+ "             __global double* result,"
+ "             __const int lengthX,"
+ "             __const int lengthY){"
+ "    const int index_a = get_global_id(0);"//Get the global indexes for 2D reference
+ "    const int index_b = get_global_id(1);"
+ "    const int local_index = get_local_id(0);"//Current thread id -> Should be the same as index_a * index_b + index_b;
+ "    if (local_index < (lengthX * lengthY)) {"// Load data into local memory
+ "       if(index_a < lengthX && index_b < lengthY)"
+ "       {"
+ "           double dx = (bufferX[index_b] - bufferX[index_a]);"
+ "           double dy = (bufferY[index_b] - bufferY[index_a]);"
+ "           scratch[local_index] = hypot(dx, dy);"
+ "       }"
+ "    } "
+ "    else {"
+ "       scratch[local_index] = 0;"// Infinity is the identity element for the min operation
+ "    }"
//Make a Barrier to make sure all values were set into the local array
+ "    barrier(CLK_LOCAL_MEM_FENCE);"
//If someone can explain to me the offset thing I'll really apreciate that...
//I just know there is alway a division by 2
+ "    for(int offset = get_local_size(0) / 2; offset > 0; offset >>= 1) {"
+ "       if (local_index < offset) {"
+ "          float other = scratch[local_index + offset];"
+ "          float mine = scratch[local_index];"
+ "          scratch[local_index] = (mine > other) ? mine : other;"
+ "       }"
+ "       barrier(CLK_LOCAL_MEM_FENCE);"
//A barrier to make sure that all values where checked
+ "    }"
+ "    if (local_index == 0) {"
+ "       result[get_group_id(0)] = scratch[0];"
+ "    }"
+ "}";

For this case, the defined GWG size is (100, 100, 0) and a LWI size of (10, 10, 0).

So, for this example, both arrays have size 10 and the GWG and LWI are obtained as follows:

//clGetKernelWorkGroupInfo(kernel, device, CL.CL_KERNEL_WORK_GROUP_SIZE, Sizeof.size_t, Pointer.to(buffer), null);
long kernel_work_group_size = OpenClUtil.getKernelWorkGroupSize(kernel, device.getCl_device_id(), 3);
//clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_ITEM_SIZES, Sizeof.size_t * numValues, Pointer.to(buffer), null);
long[] maxSize = device.getMaximumSizes();

maxSize[0] = ( kernel_work_group_size > maxSize[0] ? maxSize[0] : kernel_work_group_size);
maxSize[1] = ( kernel_work_group_size > maxSize[1] ? maxSize[1] : kernel_work_group_size);
maxSize[2] = ( kernel_work_group_size > maxSize[2] ? maxSize[2] : kernel_work_group_size);
//    maxSize[2] = 

long xMaxSize = (x > maxSize[0] ? maxSize[0] : x);
long yMaxSize = (y > maxSize[1] ? maxSize[1] : y);
long zMaxSize = (z > maxSize[2] ? maxSize[2] : z);

long local_work_size[] = new long[] { xMaxSize, yMaxSize, zMaxSize };

int numWorkGroupsX = 0;
int numWorkGroupsY = 0;
int numWorkGroupsZ = 0;

if(local_work_size[0] != 0)
  numWorkGroupsX = (int) ((total + local_work_size[0] - 1) / local_work_size[0]);

if(local_work_size[1] != 0)
  numWorkGroupsY = (int) ((total + local_work_size[1] - 1) / local_work_size[1]);

if(local_work_size[2] != 0)
  numWorkGroupsZ = (int) ((total + local_work_size[2] - 1) / local_work_size[2]);

long global_work_size[] = new long[] { numWorkGroupsX * local_work_size[0],
    numWorkGroupsY * local_work_size[1], numWorkGroupsZ *  local_work_size[2]};

The thing is I'm not getting the espected values so I decided to make some tests based on a smaller kernel and changing the [VARIABLE TO TEST VALUES] object returned in a result array:

/**
* The source code of the OpenCL program to execute
*/
private static String programSourceA =
    "#ifdef cl_khr_fp64 \n"
+ "   #pragma OPENCL EXTENSION cl_khr_fp64 : enable \n"
+ "#elif defined(cl_amd_fp64) \n"
+ "   #pragma OPENCL EXTENSION cl_amd_fp64 : enable \n"
+ "#else "
+ "   #error Double precision floating point not supported by OpenCL implementation.\n"
+ "#endif \n"
+ "__kernel void "
+ "sampleKernel(__global const double *bufferX,"
+ "             __global const double *bufferY,"
+ "             __local double* scratch,"
+ "             __global double* result,"
+ "             __const int lengthX,"
+ "             __const int lengthY){"
//Get the global indexes for 2D reference
+ "    const int index_a = get_global_id(0);"
+ "    const int index_b = get_global_id(1);"
//Current thread id -> Should be the same as index_a * index_b + index_b;
+ "    const int local_index = get_local_id(0);"
// Load data into local memory
//Only print values if index_a < ArrayA length
//Only print values if index_b < ArrayB length
//Only print values if local_index < (lengthX * lengthY)
//Only print values if this is the first work group.
+ "    if (local_index < (lengthX * lengthY)) {"
+ "       if(index_a < lengthX && index_b < lengthY)"
+ "       {"
+ "           double dx = (bufferX[index_b] - bufferX[index_a]);"
+ "           double dy = (bufferY[index_b] - bufferY[index_a]);"
+ "           result[local_index] = hypot(dx, dy);"
+ "       }"
+ "    } "
+ "    else {"
// Infinity is the identity element for the min operation
+ "       result[local_index] = 0;"
+ "    }"

The returned values are far of being the espected but, if the [VARIABLE TO TEST VALUES] is (index_a * index_b) + index_a, almost each value of the returned array has the correct (index_a * index_b) + index_a value, i mean:

result[0] -> 0
result[1] -> 1
result[2] -> 2
....
result[97] -> 97
result[98] -> 98
result[99] -> 99

but some values are: -3.350700319577517E-308....

What I'm not doing correctly???

I hope this is well explained and not that big to make you angry with me....

Thank you so much!!!!!

TomRacer

2

There are 2 answers

2
TomRacer On

thank you for your answer, first of all this kernel code is based on the commutative reduction code explained here: http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/. So I'm using that code but I added some things like the 2D operations.

Regarding to the point you mentioned before:

1.1- Actually the global work group size is (100, 100, 0)... That 100 is a result of multiplying 10 x 10 where 10 is the current array size, so my global work group size is based on this rule... then the local work item size is (10, 10, 0). Global work group size must be a multiple of local work item size, I have read this in many examples and I think this is ok.

1.2- In my test code I'm using the same arrays, in fact if I change the array size GWG size and LWI size will change dinamically.

2.1- There are not so many "if" there, there are just 3 "if", the first one checks when I must compute the hypot() based on the array objects or fill that object with zero. The second and third "if"s are just part of the reduction algorithm that seems to be fine.

2.2- Regarding to the lengthX and lengthY yeah you are right but I haven't got that yet, how should I use that??

3.1- Yeah, I know that, but I realized that I'm not using the Y axis id so maybe there is another problem here.

3.2- The reduction algorithm iterates over each pair of elements stored in the scratch variable and checking for the maximum value between them, so for each "for" that it does it is reducing the elements to be computed to the half of the previous quantity.

Also I'm going to post a some changes on the main kernel code and in the test kernel code because there where some errors.

Greetings...!!!

0
DarkZeros On

You have many problems in your code, and some of them are concept related. I think you should read the standard or OpenCL guide completely before starting to code. Because some of the system calls you are using have a different behaviour that what you expect.

  1. Work-groups and work-items are NOT like CUDA. If you want 100x100 work-items, separated into 10x10 work-groups you use as global-size (100x100) and local-size(10x10). Unlike CUDA, where the global work item is multiplied by the local size internally.

    1.1. In your test code, if you are using 10x10 with 10x10. Then you are not filling the whole space, the non filled area will still have garbage like -X.xxxxxE-308.

  2. You should not use lengthX and lengthY and put a lot of ifs in your code. OpenCL has a method to call the kernels with offsets and with specific number of items, so you can control this from the host side. BTW doing this is a performance loss and is never a good practice since the code is less readable.

  3. get_local_size(0) gives you the local size of axis 0 (10 in your case). What is what you do not understand in this call? Why do you divide it by 2 always?

I hope this can help you in your debugging process. Cheers