Stream compaction within cuda kernel for maintaining priority queue

Question

Stream compaction within cuda kernel for maintaining priority queue

872 views Asked by Chirag Jain At 10 February 2014 at 10:08

I am looking for the optimisation strategy for my cuda program. At each iteration inside the for loop of my kernel, each thread produces a score. I am maintaining a shared priority queue of the scores to maintain top-k of them per block. Please see the pseudo-code below:

__global__ gpuCompute(... arguments)
{
    __shared__ myPriorityQueue[k];  //To maintain top k scores ( k < #threads in this block)
    __shared__ scoresToInsertInQueue[#threadsInBlock];
    __shared__ counter;
    for(...)       //About 1-100 million iterations
    {
        int score = calculate_score(...);
        if(score > Minimum element of P. Queue && ...)
        {
            ATOMIC Operation : localCounter = counter++;
            scoresToInsertInQueue[localCounter] = score;
        }
        __syncthreads();
        //Merge scores from scoresToInsertInQueue to myPriorityQueue
        while(counter>0)
        {
            //Parallel insertion of scoresToInsertInQueue[counter] to myPriorityQueue using the participation of k threads in this block
            counter--;  
            __syncthreads(); 
        }
        __syncthreads();
    }
}

Hoping that above code makes sense to you guys. Now, I am looking for a way to remove the atomic operation overhead s.t. each thread saves '1' or '0' depending upon the value should go to priority queue or not. I am wondering if any there is any implementation of stream-compaction within kernel so that I can reduce '1000000000100000000' to '11000000000000000000' buffer (or know the index of '1's) and finally insert the scores corresponding to '1's in the queue.
Note that '1's would be very sparse in this case.

Original Q&A

There are 1 answers

**Robert Crovella** · Answer 1 · 2014-02-10T19:51:39+00:00

If the ones are very sparse, the atomic method may be fastest. However the method I describe here will have more predictable and bounded worst-case performance.

For a good mix of ones and zeroes in your decision array, it may be faster to use a parallel scan or prefix-sum to build an insertion-point index array out of the decision array:

Suppose I have a thresholding decision that chooses scores > 30 to go into the queue. My data might look like this:

scores:     30  32  28  77  55  12  19
score > 30:  0   1   0   1   1   0   0
insert_pt:   0   0   1   1   2   3   3    (an "exclusive prefix sum")

Then each thread makes a storage choice as follows:

if (score[threadIdx.x] > 30) temp[threadIdx.x] = 1;
else temp[threadIdx.x] = 0;
__syncthreads();
// perform exclusive scan on temp array into insert_pt array
__syncthreads();
if (temp[threadIdx.x] == 1)
  myPriorityQueue[insert_pt[threadIdx.x]] = score[threadIdx.x];

CUB has a fast parallel prefix scan.

TechQA.

Stream compaction within cuda kernel for maintaining priority queue

There are 1 answers

Related Questions in CUDA

Related Questions in PARALLEL-PROCESSING

Related Questions in ATOMIC

Related Questions in PRIORITY-QUEUE

Related Questions in STREAM-COMPACTION

Popular Questions

Popular Tags

Trending Questions