Backround: I got a kernel called "buildlookuptable" which does some calculation and stores its result into an int array called "dense_id"
creating cl_mem object:
cl_mem dense_id = clCreateBuffer(context, CL_MEM_READ_WRITE, (inCount1) * sizeof(int), NULL, &err); errWrapper("create Buffer", err);
Setting the kernel argument:
errWrapper("setKernel", clSetKernelArg(kernel_buildLookupTable, 5, sizeof(cl_mem), &dense_ids));
dense_ids is used in other kernels afterwards. Due to terrible memory allignment i have a huge drop in performance.
The following kernel accesses dense_id like this:
result_tuples += (dense_id[bucket+1] - dense_id[bucket]);
Execution time: 66ms no compiler based vectorization
However if i change the line into:
result_tuples += (dense_id[bucket] - dense_id[bucket]);
Execution time: 2ms vectorized(4) by compiler Both kernels ran on a geforce 660ti.
So if i remove the overlapping memory access, the speed greatly increases. Thread N accesses memory N, no overlapping.
In order to achieve correct results i would like to duplicate the cl_mem Object dense_id. So the line in the following kernel would be:
result_tuples += (dense_id1[bucket+1] - dense_id2[bucket]);
Whereas dense_id1 and dense_id2 are identic. Another idea would be to shift the contents of dense_id1 by one element. So the kernel line would be:
result_tuples += (dense_id1[bucket] - dense_id2[bucket]);
As dense_id is a small memory object i am sure, i could improve my execution time at the cost of memory with copying it.
Question:
After the kernel execution of "buildlookuptable" I would like to duplicate the result array dense_id on the device side.
The straight way would be using a ClEnqueueReadBuffer
at host side to fetch dense_id, create a new cl_mem object and push it back to the device.
Is there a way to duplicate dense_id after "buildlookuptable" finished, without copying it to the host again?
If requested I can add more code here. I tried to only use the required parts, as I dont want to drown you in irrelevant code.
I tried the solution with the Clenqueuecopybuffer command which works as desired. The solution to my problem ist:
Without using another kernel it is possible to duplicate a Memory Object on Device side only.
In order to do so, you must first create another cl_mem object on host side:
As i had to wait for the copy to finish i used
to let the program wait for its termination
As hinted by DarkZeros the performance gain was 0, because the compiler optimized the line
to 0.
Thank you for you insights so far!