opencl duplicate memory object on device

655 views Asked by At

Backround: I got a kernel called "buildlookuptable" which does some calculation and stores its result into an int array called "dense_id"

creating cl_mem object:

cl_mem dense_id = clCreateBuffer(context, CL_MEM_READ_WRITE, (inCount1) * sizeof(int), NULL, &err); errWrapper("create Buffer", err);

Setting the kernel argument:

errWrapper("setKernel", clSetKernelArg(kernel_buildLookupTable, 5, sizeof(cl_mem), &dense_ids));

dense_ids is used in other kernels afterwards. Due to terrible memory allignment i have a huge drop in performance.

The following kernel accesses dense_id like this:

result_tuples += (dense_id[bucket+1] - dense_id[bucket]);

Execution time: 66ms no compiler based vectorization

However if i change the line into:

result_tuples += (dense_id[bucket] - dense_id[bucket]);

Execution time: 2ms vectorized(4) by compiler Both kernels ran on a geforce 660ti.

So if i remove the overlapping memory access, the speed greatly increases. Thread N accesses memory N, no overlapping.

In order to achieve correct results i would like to duplicate the cl_mem Object dense_id. So the line in the following kernel would be:

result_tuples += (dense_id1[bucket+1] - dense_id2[bucket]);

Whereas dense_id1 and dense_id2 are identic. Another idea would be to shift the contents of dense_id1 by one element. So the kernel line would be:

result_tuples += (dense_id1[bucket] - dense_id2[bucket]);

As dense_id is a small memory object i am sure, i could improve my execution time at the cost of memory with copying it.

Question: After the kernel execution of "buildlookuptable" I would like to duplicate the result array dense_id on the device side. The straight way would be using a ClEnqueueReadBuffer at host side to fetch dense_id, create a new cl_mem object and push it back to the device. Is there a way to duplicate dense_id after "buildlookuptable" finished, without copying it to the host again?

If requested I can add more code here. I tried to only use the required parts, as I dont want to drown you in irrelevant code.

1

There are 1 answers

0
Käptn Freiversuch On BEST ANSWER

I tried the solution with the Clenqueuecopybuffer command which works as desired. The solution to my problem ist:

clEnqueueCopyBuffer(command_queue, count_buffer, count_buffer3, 1, 0, (inCount1 + 1) * sizeof(int), NULL, NULL, NULL);

Without using another kernel it is possible to duplicate a Memory Object on Device side only.

In order to do so, you must first create another cl_mem object on host side:

cl_mem count_buffer3 = clCreateBuffer(context, CL_MEM_READ_WRITE, (inCount1 + 1) * sizeof(int), NULL, &err); errWrapper("create Buffer", err);

As i had to wait for the copy to finish i used

clFinish(command_queue);

to let the program wait for its termination

As hinted by DarkZeros the performance gain was 0, because the compiler optimized the line

result_tuples += (dense_id[bucket] - dense_id[bucket]);

to 0.

Thank you for you insights so far!