I am looking for the optimisation strategy for my cuda program. At each iteration inside the for loop of my kernel, each thread produces a score. I am maintaining a shared priority queue of the scores to maintain top-k of them per block. Please see the pseudo-code below:
__global__ gpuCompute(... arguments)
{
__shared__ myPriorityQueue[k]; //To maintain top k scores ( k < #threads in this block)
__shared__ scoresToInsertInQueue[#threadsInBlock];
__shared__ counter;
for(...) //About 1-100 million iterations
{
int score = calculate_score(...);
if(score > Minimum element of P. Queue && ...)
{
ATOMIC Operation : localCounter = counter++;
scoresToInsertInQueue[localCounter] = score;
}
__syncthreads();
//Merge scores from scoresToInsertInQueue to myPriorityQueue
while(counter>0)
{
//Parallel insertion of scoresToInsertInQueue[counter] to myPriorityQueue using the participation of k threads in this block
counter--;
__syncthreads();
}
__syncthreads();
}
}
Hoping that above code makes sense to you guys. Now, I am looking for a way to remove the atomic operation overhead s.t. each thread saves '1' or '0' depending upon the value should go to priority queue or not. I am wondering if any there is any implementation of stream-compaction within kernel so that I can reduce '1000000000100000000' to '11000000000000000000' buffer (or know the index of '1's) and finally insert the scores corresponding to '1's in the queue.
Note that '1's would be very sparse in this case.
If the ones are very sparse, the
atomic
method may be fastest. However the method I describe here will have more predictable and bounded worst-case performance.For a good mix of ones and zeroes in your decision array, it may be faster to use a parallel scan or prefix-sum to build an insertion-point index array out of the decision array:
Suppose I have a thresholding decision that chooses scores > 30 to go into the queue. My data might look like this:
Then each thread makes a storage choice as follows:
CUB has a fast parallel prefix scan.