I am new in CUDA. I am working basic parallel algorithms, like reduction, in order to understand how thread execution is working. I have the following code:
__global__ void
Reduction2_kernel( int *out, const int *in, size_t N )
{
extern __shared__ int sPartials[];
int sum = 0;
const int tid = threadIdx.x;
for ( size_t i = blockIdx.x*blockDim.x + tid;
i < N;
i += blockDim.x*gridDim.x ) {
sum += in[i];
}
sPartials[tid] = sum;
__syncthreads();
for ( int activeThreads = blockDim.x>>1;
activeThreads > 32;
activeThreads >>= 1 ) {
if ( tid < activeThreads ) {
sPartials[tid] += sPartials[tid+activeThreads];
}
__syncthreads();
}
if ( threadIdx.x < 32 ) {
volatile int *wsSum = sPartials;
if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32]; // why do we need this statement, any exampele please?
wsSum[tid] += wsSum[tid + 16]; //how these statements are executed in paralle within a warp
wsSum[tid] += wsSum[tid + 8];
wsSum[tid] += wsSum[tid + 4];
wsSum[tid] += wsSum[tid + 2];
wsSum[tid] += wsSum[tid + 1];
if ( tid == 0 ) {
volatile int *wsSum = sPartials;// why this statement is needed?
out[blockIdx.x] = wsSum[0];
}
}
}
Unfortunately it is not clear to me how the code is working from the if ( threadIdx.x < 32 )
condition and after. Can somebody give an intuitive example with thread ids and how the statements are executed? I think it is important to understand these conceptes so any help it would be helpful!!
Let's look at the code in blocks, and answer your questions along the way:
The above code travels through a data set of size
N
. An assumption we can make for understanding purposes is thatN
>blockDim.x*gridDim.x
, this last term simply being the total number of threads in the grid. SinceN
is larger than the total threads, each thread is summing multiple elements from the data set. From the standpoint of a given thread, it is summing elements that are spaced by the grid dimension of threads (blockDim.x*gridDim.x
) Each thread stores it's sum in a local (presumably register) variable namedsum
.As each thread finishes (i.e., as it's for-loop exceeds
N
) it stores it's intermediatesum
in shared memory, and then waits for all other threads in the block to finish.So far we haven't talked about the dimension of the block - it hasn't mattered. Let's assume each block has some integer multiple of 32 threads. The next step will be to start gathering the various intermediate sums stored in shared memory into smaller and smaller groups of variables. The above code starts out by selecting half of the threads in the threadblock (
blockDim.x>>1
) and uses each of those threads to combine two of the partial sums in shared memory. So if our threadblock started out at 128 threads, we just used 64 of those threads to reduce 128 partial sums into 64 partial sums. This process continues repetetively in the for loop, each time cutting the threads in half and combining partial sums, two at a time per thread. This process continues as long asactiveThreads
> 32. So ifactiveThreads
is 64, then those 64 threads will combine 128 partial sums into 64 partial sums. But whenactiveThreads
becomes 32, the for-loop is terminated, without combining 64 partial sums into 32. So at the completion of this block of code, we have taken the (arbitrary multiple of 32 threads) threadblock, and reduced however many partial sums we started out with, down to 64. This process of combining say 256 partial sums, to 128 partial sums, to 64 partial sums, must wait at each iteration for all threads (in multiple warps) to complete their work, so the__syncthreads();
statement is executed with each pass of the for-loop.Keep in mind, at this point, we have reduced our threadblock to 64 partial sums.
For the remainder of the kernel after this point, we will only be using the first 32 threads (i.e. the first warp). All other threads will remain idle. Note that there are no
__syncthreads();
after this point either, as that would be a violation of the rule for using it (all threads must participate in a__syncthreads();
).We are now creating a
volatile
pointer to shared memory. In theory, this tells the compiler that it should not do various optimizations, such as optimizing a particular value into a register, for example. Why didn't we need this before? Because__syncthreads();
also carries with it a memory-fencing function. A__syncthreads();
call, in addition to causing all threads to wait at the barrier for each other, also forces all thread updates back into shared or global memory. We can no longer depend on this feature, however, because from here on out we will not be using__syncthreads();
because we have restricted ourselves -- for the remainder of the kernel -- to a single warp.The previous reduction block left us with 64 partial sums. But we have at this point restricted ourselves to 32 threads. So we must do one more combination to gather the 64 partial sums into 32 partial sums, before we can proceed with the remainder of the reduction.
Now we are finally getting into some warp-synchronous programming. This line of code depends on the fact that 32 threads are executing in lockstep. To understand why (and how it works at all) it will be convenient to break this down into the sequence of operations needed to complete this line of code. It looks something like:
All 32 threads will follow the above sequence in lock-step. All 32 threads will begin by reading
wsSum[tid]
into a (thread-local) register. That means thread 0 readswsSum[0]
, thread 1 readswsSum[1]
etc. After that, each thread reads another partial sum into a different register: thread 0 readswsSum[16]
, thread 1 readswsSum[17]
, etc. It's true that we don't care about thewsSum[32]
(and higher) values; we've already collapsed those into the first 32wsSum[]
values. However, as we'll see, only the first 16 threads (at this step) will contribute to the final result, so the first 16 threads will be combining the 32 partial sums into 16. The next 16 threads will be acting as well, but they are just doing garbage work -- it will be ignored.The above step combined 32 partial sums into the first 16 locations in
wsSum[]
. The next line of code:repeats this process with a granularity of 8. Again, all 32 threads are active, and the micro-sequence is something like this:
So the first 8 threads combine the first 16 partial sums (
wsSum[0..15]
) into 8 partial sums (contained inwsSum[0..7]
). The next 8 threads are also combiningwsSum[8..23]
intowsSums[8..15]
, but the writes to 8..15 occur after those values were read by threads 0..8, so the valid data is not corrupted. It's just extra junk work going on. Likewise for the other blocks of 8 threads within the warp. So at this point we have combined the partial sums of interest into 8 locations.And these lines of code follow a similar pattern as the previous two, partitioning the warp into 8 groups of 4 threads (only the first 4-thread group contributes to the final result) and then partitioning the warp into 16 groups of 2 threads, with only the first 2-thread group contributing to the final result. And finally, into 32 groups of 1 thread each, each thread generating a partial sum, with only the first partial sum being of interest.
At last, in the previous step, we had reduced all partial sums down to a single value. It's now time to write that single value out to global memory. Are we done with the reduction? Perhaps, but probably not. If the above kernel were launched with only 1 threadblock, then we would be done -- our final "partial" sum is in fact the sum of all elements in the data set. But if we launched multiple blocks, then the final result from each block is still a "partial" sum, and the results from all blocks must be added together (somehow).
And to answer your final question?
My guess is that it was left around from a previous iteration of the reduction kernel, and the programmer forgot to delete it, or didn't notice that it wasn't needed. Perhaps someone else will know the answer to this one.
Finally, the cuda reduction sample provides very good reference code for study, and the accompanying pdf document does a good job of describing the optimizations that can be made along the way.