OpenCl matrix transposition with memory coalescing

Question

OpenCl matrix transposition with memory coalescing

2.8k views Asked by MiepMiep At 16 October 2024 at 11:58

I´m currently trying to transpose a Matrix in OpenCl with memory coalescing.

I've already tansposed the Matrix in a "simple" way which worked perfectly fine. When I tried to do the same thing now with memory coalescing, i was hoping to see a little improvement in the execution time, but my implementation is actually slower than the simple implementation (The implementation is correct, it's just not efficent). I think I haven't exactly understood how to ensure that the horizontally neighboring work-items write on horizontally neighboring addresses.

Here is the Kernel for my coalisced implementation:

__kernel void MatrixTranspose(__global const float* Matrix, 
__global float* MatrixTransposed, uint Width, uint Height, __local float* block) {

    int2 GlobalID;
    GlobalID.x = get_global_id(0);
    GlobalID.y = get_global_id(1);

    int2 LocalID;
    LocalID.x = get_local_id(0);
    LocalID.y = get_local_id(1);

    block[LocalID.y*get_local_size(0) + LocalID.x] = Matrix[GlobalID.y*Width + GlobalID.x];

    barrier(CLK_LOCAL_MEM_FENCE);

    int2 groupId;
    groupId.x = get_group_id(0);
    groupId.y = get_group_id(1);
    int2 localSize;
    localSize.x = get_local_size(0);
    localSize.y = get_local_size(1);
    MatrixTransposed[Height*(LocalID.x + groupId.x*localSize.x) + Height - (LocalID.y + groupId.y*localSize.y) - 1] = block[LocalID.y*localSize.x + LocalID.x];
}

I hope someone can give me an advice, thank you :)

Original Q&A

There are 2 answers

Roman Arzumanyan On 19 August 2014 at 09:52

Read columns from original matrix, write them as rows in local memory to avoid memory bank conflicts, and then store rows from local memory into transposed matrix.

In this case two write operations are easy to coalesce (neighbour Work Items write to neighbour memory cells). Read operations are not so good, however.

BTW, what is your Device? If it's good with vector operations, use vload/vstore operations, it may improve IO performance significantly.

**mfa** · Accepted Answer · 2014-08-19 14:10:18

Unfortunately, you are going to be bound by your global read and write speed of the device. Normally you transpose the matrix to do some calculation, and that helps hide the latency. You are reading to local memory, waiting for the barrier, and writing black to global in your example. This only adds the extra step and complexity of using local memory.

You should do something with the data while it is in local memory if you want to hide the global memory latency.

If all you want to do is transpose the matrix, simply read from global and write to the target location in global directly. Maybe look into async_work_group_copy if you still want to try using local memory.

Now for my answer.

Try making a work item responsible for more than a single float. If you read a 4x4 region with a work item, you can transpose it in private memory. This would not only skip local memory, but eliminate the need for a barrier, and reduce the number of work items you need by a factor of 16.

steps:

calculate src and dest global memory addresses
load four float4 values from global
transpose the 4x4 floats by swapping their w,x,y,z values accordingly
store 4 float4 values at new location in global memory
handle the edge regions of the matrix in a separate kernel, or in the host program for matrices with non-multiple-of-four dimensions (or pad your input matrix to make it multiple of 4)

TechQA.

OpenCl matrix transposition with memory coalescing

There are 2 answers

Related Questions in OPENCL

Related Questions in GPU

Related Questions in COALESCING

Popular Questions

Popular Tags

Trending Questions