I´m currently trying to transpose a Matrix in OpenCl with memory coalescing.
I've already tansposed the Matrix in a "simple" way which worked perfectly fine. When I tried to do the same thing now with memory coalescing, i was hoping to see a little improvement in the execution time, but my implementation is actually slower than the simple implementation (The implementation is correct, it's just not efficent). I think I haven't exactly understood how to ensure that the horizontally neighboring work-items write on horizontally neighboring addresses.
Here is the Kernel for my coalisced implementation:
__kernel void MatrixTranspose(__global const float* Matrix,
__global float* MatrixTransposed, uint Width, uint Height, __local float* block) {
int2 GlobalID;
GlobalID.x = get_global_id(0);
GlobalID.y = get_global_id(1);
int2 LocalID;
LocalID.x = get_local_id(0);
LocalID.y = get_local_id(1);
block[LocalID.y*get_local_size(0) + LocalID.x] = Matrix[GlobalID.y*Width + GlobalID.x];
barrier(CLK_LOCAL_MEM_FENCE);
int2 groupId;
groupId.x = get_group_id(0);
groupId.y = get_group_id(1);
int2 localSize;
localSize.x = get_local_size(0);
localSize.y = get_local_size(1);
MatrixTransposed[Height*(LocalID.x + groupId.x*localSize.x) + Height - (LocalID.y + groupId.y*localSize.y) - 1] = block[LocalID.y*localSize.x + LocalID.x];
}
I hope someone can give me an advice, thank you :)
Unfortunately, you are going to be bound by your global read and write speed of the device. Normally you transpose the matrix to do some calculation, and that helps hide the latency. You are reading to local memory, waiting for the barrier, and writing black to global in your example. This only adds the extra step and complexity of using local memory.
You should do something with the data while it is in local memory if you want to hide the global memory latency.
If all you want to do is transpose the matrix, simply read from global and write to the target location in global directly. Maybe look into async_work_group_copy if you still want to try using local memory.
Now for my answer.
Try making a work item responsible for more than a single float. If you read a 4x4 region with a work item, you can transpose it in private memory. This would not only skip local memory, but eliminate the need for a barrier, and reduce the number of work items you need by a factor of 16.
steps: