Reduction in CUDA

2.7k views Asked by At

I'm just starting to learn CUDA programming, and I have some confusion about reduction.

I know that global memory has much visiting delay as compared to shared memory, but can I use global memory to (at least) simulate a behavior similar to shared memory?

For example, I want to sum the elements of a large array whose length is exactly BLOCK_SIZE * THREAD_SIZE (both the dimensions of grid and block are power of 2), and I've tried to use the code below:

    __global__ void parallelSum(unsigned int* array) {

    unsigned int totalThreadsNum = gridDim.x * blockDim.x;
    unsigned int idx = blockDim.x * blockIdx.x + threadIdx.x;

    int i = totalThreadsNum / 2;
    while (i != 0) {
            if (idx < i) {
                array[idx] += array[idx + i];
        }
        __syncthreads();
        i /= 2;
    }
}

I compared the result of this code and the result generated serially on the host, and the weird thing is: sometimes the results are the same, but sometimes they are apparently different. Is there any reason related to using global memory here?

2

There are 2 answers

1
Tom On BEST ANSWER

Your best bet is to start with the reduction example in the CUDA Samples. The scan example is also good for learning the principles of parallel computing on a throughput architecture.

That said, if you actually just want to use a reduction operator in your code then you should look at Thrust (calling from host, cross-platform) and CUB (CUDA GPU specific).

To look at your specific questions:

  • There's no reason you can't use global memory for a reduction, the example code in the toolkit walks through various levels of optimisation but in each case the data starts in global memory.
  • Your code is inefficient (see the example in the toolkit for more details on the work efficiency!).
  • Your code is attempting to communicate between threads in different blocks without proper synchronisation; __syncthreads() only synchronises threads within a specific block and not across the different blocks (that would be impossible, at least generically, since you tend to oversubscribe the GPU meaning not all blocks will be running at a given time).

The last point is the most important. If a thread in block X wants to read data that is written by block Y then you need to break this across two kernel launches, that's why a typical parallel reduction takes a multi-phase approach: reduce batches within blocks, then reduce between the batches.

0
Vitality On

Tom has already answered this question. In his answer, he recommends using Thrust or CUB to perform reductions in CUDA.

Here, I'm providing a fully worked example on how using both libraries to perform reductions.

#define CUB_STDERR

#include <stdio.h>

#include <thrust/device_ptr.h>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>

#include <cub/device/device_reduce.cuh>

#include "TimingGPU.cuh"
#include "Utilities.cuh"

using namespace cub;

/********/
/* MAIN */
/********/
int main() {

    const int N = 8388608;

    gpuErrchk(cudaFree(0));

    float *h_data       = (float *)malloc(N * sizeof(float));
    float h_result = 0.f;

    for (int i=0; i<N; i++) {
        h_data[i] = 3.f;
        h_result = h_result + h_data[i];
    }

    TimingGPU timerGPU;

    float *d_data;          gpuErrchk(cudaMalloc((void**)&d_data, N * sizeof(float)));
    gpuErrchk(cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice));

    /**********/
    /* THRUST */
    /**********/
    timerGPU.StartCounter();
    thrust::device_ptr<float> wrapped_ptr = thrust::device_pointer_cast(d_data);
    float h_result1 = thrust::reduce(wrapped_ptr, wrapped_ptr + N);
    printf("Timing for Thrust = %f\n", timerGPU.GetCounter());

    /*******/
    /* CUB */
    /*******/
    timerGPU.StartCounter();
    float           *h_result2 = (float *)malloc(sizeof(float));
    float           *d_result2; gpuErrchk(cudaMalloc((void**)&d_result2, sizeof(float)));
    void            *d_temp_storage = NULL;
    size_t          temp_storage_bytes = 0;

    DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_data, d_result2, N);
    gpuErrchk(cudaMalloc((void**)&d_temp_storage, temp_storage_bytes));
    DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_data, d_result2, N);

    gpuErrchk(cudaMemcpy(h_result2, d_result2, sizeof(float), cudaMemcpyDeviceToHost));

    printf("Timing for CUB = %f\n", timerGPU.GetCounter());

    printf("Results:\n");
    printf("Exact: %f\n", h_result);
    printf("Thrust: %f\n", h_result1);
    printf("CUB: %f\n", h_result2[0]);

}

Please, note that CUB can be somewhat faster than Thrust due to the different underlying philosophy, since CUB leaves performance-critical details, such as the exact choice of the algorithm and the degree of concurrency unbound and in the hands of the user. In this way, these parameters can be tuned in order to maximize performance for a particular architecture and application.

A comparison for calculating the Euclidean norm of an array is reported at CUB in Action – some simple examples using the CUB template library.