I have the following problem: given two sorted arrays A and B, I have to produce a sorted array C with the elements of A and B.
I found some solution for solving this problem using CUDA: Merge Path, for example http://www.cc.gatech.edu/~bader/papers/GPUMergePath-ICS2012.pdf
However, their problem is given by the size of the original arrays, at least 10k elements. I have a different problem.
The arrays I've to merge are much smaller (1000 elements at most) and the complexity is given by the number of merges to be done (the order of 10 to the power of 10, 10^5 arrays of size ~1000 to be merged with each other).
Part of their problem is to split the original arrays into equally sized parts that are processed in parallel. The arrays I have to merge are small enough to entirely fit in the GPU shared memory.
Thrust is not my first choice because the output of my procedure is not the sorted array itself, but a calculation with its elements: so I think that a specialized kernel should be faster than first sort the element indices and then use them for the calculation.
A serial version of the algorithm looks like:
i=0
j=0
k=0
T=4
while i<N and j<M:
if A[i]<B[j]:
start_i = max(0,i-T)
C[k]=sum(A[start_i:i+1])
i+=1
else:
start_j = max(0,j-T)
C[k]=sum(B[start_j:j+1])
j+=1
k+=1
while i<N:
start_i = max(0,i-T)
C[k]=sum(A[start_i:i+1])
i+=1
k+=1
while j<M:
start_j = max(0,j-T)
C[k]=sum(B[start_j:j+1])
j+=1
k+=1
How can I exploit CUDA capabilities to solve this problem?
The two most important optimization goals for any CUDA program should be to:
There are certainly many other things that can be considered during optimization, but these are the two most important items to address first.
A merge operation (not quite the same as a merge-sort) is, at first glance, an inherently serial operation. We cannot make a proper decision about which item to choose, from either A or B input array, to place next in the output array C, until we have made all previous selections in C. In this respect, the merge algorithm (in this realization) makes it difficult to expose parallelism, and the paper linked in the question is almost entirely focused on that topic.
The goal of the algorithm described in the paper is to decompose the two input vectors A and B into multiple smaller pieces that can be worked on independently, so as to expose parallelism. In particular, the goal is to keep all the SMs in a GPU busy, and keep all the SP's in an SM busy. Once a sufficient decomposition of work is performed, each SP is ultimately performing a sequential merge (as mentioned in the paper):
However, as you've pointed out, what you want to do is somewhat different. You already have many arrays, and you want to perform independent merge operations on those many arrays. Since your array count is ~100,000, this is enough independent pieces of work to consider mapping each to a GPU SP (ie. thread). This means that we can then, just as in the paper, use a simple sequential merge on each core/SP/thread. So the problem of exposing parallelism is in your case, already done (to perhaps a sufficient degree).
At this point we could consider implementing this as-is. The code I show later offers this as a starting point for comparison. However what we discover is the performance is not very good, and this is due to the fact that a merge algorithm fundamentally has a data-dependent access sequence, and so it is (more) difficult to arrange for coalesced access on the GPU. The authors of the paper propose to mitigate this problem by first reading the data (in a coalesced fashion) into shared memory, and then having the algorithm work on it out of shared memory, where the penalty for disorganized access is less.
I'll propose a different approach:
Here's a worked example that implements the above idea, running a simple sequential merge in each thread, and each thread merging one of the A vectors with one of the B vectors:
Notes:
The GPU is actually slower than the naive single-threaded CPU code when the memory organization is row major. But for the column-major organization (which tends to improve opportunities for coalesced access) the GPU code is about 10x faster than the CPU code for my test case. This ~10x speedup factor is roughly in the range (~10-20x) of the speedup factors shown in the paper for a GPU MergePath 32-bit integer speedup vs. x86 serial merge.
using
int
vs.float
datatypes makes a significant difference in the CPU timing.int
seems to be faster (on the CPU) so I'm showing that version here. (This disparity is mentioned in the paper as well.)The
-DTIMING
switch added to the compile command pares down the first validation function so that it just does the CPU merge operation, for timing.The basic merge code is templated to be able to handle different data types, and parameterized so that it can used in either the column-major or row major operation.
I've dispensed with CUDA error checking for brevity of presenation. However, any time you're having trouble with a CUDA code, you should always use proper cuda error checking.
What about using thrust (as I suggested in the comments)? It should be possible to use thrust::merge with a suitable device/sequential execution policy, to more or less mimic what I have done above. However, thrust expects vectors to be contiguous, and so, without additional complexity, it could only be used in the row-major case, which we've seen is severely penalized by bad memory access patterns. It should be possible to create a set of permutation iterators in thrust that would allow the column-major, strided access that improves the memory scenario, but I have not pursued that.