I am working on a matrix-transpose in-place-implementation, with the following idea: Let the indices only take values that correspond to upper triangular entries (since diagonal ones stay constant during the transformation) and exchange values A[i,j] and A[j,i] within one thread via a temporary integer.
I am having trouble to assign the correct next index pair A[i_next,j_next] to the threat currently working on A[i,j] (this corresponds to the last two lines in the code snippet). How do I have to modify these two lines in order to access only the correct matrix entries for each threat?
Example:
for N=5; total-threads=3;
This is a 5x5 matrix, where the entries hold the thread_id of the threat they shall be accessed by. Those belonging to the first thread are italic as well (just to show which indexes I want row_idx, col_idx to assume for that specific thread.
| Col0 | Col1 | Col2 | Col3 | Col4 |
|---|---|---|---|---|
| / | 1 | 2 | 3 | 1 |
| / | / | 2 | 3 | 1 |
| / | / | / | 2 | 3 |
| / | / | / | / | 1 |
| / | / | / | / | / |
__global__ void in_place_transpose(double *A, int N)
{
int t_idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int total_threads = blockDim.x * gridDim.x;
int row_idx = t_idx / N;
int col_idx = t_idx % N;
// only access upper triangular entries
if (row_idx < col_idx)
{
while (row_idx < (N - 1) && col_idx < N) // A[N-1,N] is the last entry we change
{
int to = row_idx * N + col_idx;
int from = col_idx * N + row_idx;
if (from != to) // daigonal is constant
{
// printf("from: %d, to: %d \n", from, to); // show operation pair
// printf("th: %d, row: %d, col: %d \n", t_idx, row_idx, col_idx); // show indizes
int temp = A[to];
A[to] = A[from];
A[from] = temp;
}
// assign next matrix entry to thread (FAULTY)
row_idx = (int)((row_idx + total_threads) % (N - row_idx - 1));
col_idx = (int)(col_idx + (total_threads % (N - row_idx - 1)) % (N - row_idx - 1));
}
}
}
With that preamble, I believe what you are asking for is how to convert a linear index into a 2-dimensional triangular index.
That problem can be solved without using iteration, recursion, or floating-point arithmetic, or square-root or other advanced functions, just addition, subtraction, multiplication, and division.
Once we reduce the problem to one that can be solved using a linear index, then we can apply methodologies like grid-stride loop on a linear index, with conversion to 2D triangular "on the fly" to allow an arbitrary number of threads to work on and complete the problem.
Based on those ideas, here is one possible approach:
I don't see much point in going deeply into performance. A naive transpose such as what you have shown can have one "coalesced" half but will then have the other half (badly) "uncoalesced". As far as that goes, this algorithm isn't any different. The linear-to-triangular conversion will still tend to create indices such that adjacent threads will often be loading (or storing, depending on how you write it) adjacent elements, but of course the staggered pattern of the triangular data patch introduces discontinuities in the adjacency pattern. And if your loads are partly coalesced, your stores won't be.