I'm following a previous answered question here about how to implement an all reduce in cuda, which links to a slide deck from nvidia. What I have works majority of the time (when the input size is a nice power of 2), but some input sizes results in answers which seems to be off by 2.
In the reduction, once there is <= 32 elements to reduce, the last warp simply unrolls. The final reduce will sum partial_sum[0] + partial_sum[1]. In the example below, I print out the stored values before and after the final partial sum to see where the calculation is deviating. Here is the output of the program:
threadIdx 0, partial_sum[0] 14710790.000000
threadIdx 1, partial_sum[1] 18872320.000000
threadIdx 0, partial_sum[0] 33583112.000000
threadIdx 1, partial_sum[1] 33059334.000000
expected 33583110, result 33583112
The first two lines are the stored partial sums. The correct result is 14710790 + 18872320 = 33583110 which should then be stored in partial_sum[0] on line 3. The stored results on the first 2 lines are correct, but the result on line 3 of the final addition is not.
Here is the full working example, compiling with nvcc test.cu. Any guidance would be appreciated!
#include <iostream>
#include <numeric>
#include <vector>
constexpr unsigned int THREADS_PER_DIM = 512;
__global__ void reduce(float *v, float *v_r, std::size_t N, bool print) {
// Allocate shared memory
volatile __shared__ float partial_sum[THREADS_PER_DIM * 4];
// Load elements AND do first add of reduction
// We halved the number of blocks, and reduce first 2 elements into current and what would be the next block
unsigned int i = blockIdx.x * (blockDim.x * 2) + threadIdx.x;
// Store first partial result instead of just the elements
// Set to 0 first to be safe
partial_sum[threadIdx.x + blockDim.x] = 0;
partial_sum[threadIdx.x] = 0;
partial_sum[threadIdx.x] = (i < N ? v[i] : 0) + ((i + blockDim.x) < N ? v[i + blockDim.x] : 0);
__syncthreads();
// Start at 1/2 block stride and divide by two each iteration
// Stop early (call device function instead)
for (unsigned int s = blockDim.x / 2; s > 32; s >>= 1) {
// Each thread does work unless it is further than the stride
if (threadIdx.x < s) {
partial_sum[threadIdx.x] += partial_sum[threadIdx.x + s];
}
__syncthreads();
}
if (threadIdx.x < 32) {
partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 32];
partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 16];
partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 8];
partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 4];
partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 2];
if (threadIdx.x < 2 && blockIdx.x == 0 && print) {
printf("threadIdx %d, partial_sum[%d] %f\n", threadIdx.x, threadIdx.x, partial_sum[threadIdx.x]);
}
partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 1];
if (threadIdx.x < 2 && blockIdx.x == 0 && print) {
printf("threadIdx %d, partial_sum[%d] %f\n", threadIdx.x, threadIdx.x, partial_sum[threadIdx.x]);
}
}
// Let the thread 0 for this block write it's result to main memory
// Result is inexed by this block
if (threadIdx.x == 0) {
v_r[blockIdx.x] = partial_sum[0];
}
}
constexpr std::size_t N = (1 << 13) + 4;
int main() {
// Init and fill vec
std::vector<float> a(N);
for (std::size_t i = 0; i < N; ++i) {
a[i] = static_cast<float>(i);
}
// allocate and copy on device
std::size_t bytes = N * sizeof(float);
float *d_A;
float *d_A_reduction;
cudaMalloc((void **)&d_A, bytes);
cudaMalloc((void **)&d_A_reduction, bytes);
cudaMemset(d_A_reduction, 0, bytes);
cudaMemcpy(d_A, a.data(), bytes, cudaMemcpyHostToDevice);
// 8192 + 4 elements with 512 threads = 16 + 1 blocks
// Each block actually considered elements in adjacent block on first add in kernel, so we need 9 blocks
auto num_blocks = 9;
reduce<<<num_blocks, THREADS_PER_DIM>>>(d_A, d_A_reduction, N, false);
cudaMemcpy(d_A, d_A_reduction, bytes, cudaMemcpyDeviceToDevice);
reduce<<<1, THREADS_PER_DIM>>>(d_A, d_A_reduction, num_blocks, true);
float result = 0;
cudaMemcpy(&result, d_A_reduction, sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_A);
cudaFree(d_A_reduction);
auto correct_result = std::accumulate(a.begin(), a.end(), 0.0);
std::cout << "expected " << static_cast<int>(correct_result) << ", result " << static_cast<int>(result)
<< std::endl;
}
As noted in the comments, there were 2 issues. Cuda 9 changed the guarantees about threads in a warp remaining in lockstep, previously answered here: Cuda min warp reduction produces race condition
The second issue after fixing this was due to imprecision in floating point representations, where
32774 + 33550336cannot be represented by33583110, but instead is33583112which matches the off by 2.