Heisenbug in CUDA kernel, global memory access

419 views Asked by At

About two years ago, I wrote a kernel for work on several numerical grids simultaneously. Some very strange behaviour emerged, which resulted in wrong results. When hunting down the bug utilizing printf()-statements inside the kernel, the bug vanished.

Due to deadline constraints, I kept it that way, though recently I figured that this was no appropriate coding style. So I revisited my kernel and boiled it down to what you see below.

__launch_bounds__(672, 2)
__global__ void heisenkernel(float *d_u, float *d_r, float *d_du, int radius,
        int numNodesPerGrid, int numBlocksPerSM, int numGridsPerSM, int numGrids)
{
    __syncthreads();
    int id_sm           = blockIdx.x /   numBlocksPerSM;                                    // (arbitrary) ID of Streaming Multiprocessor (SM) this thread works upon           - (constant over lifetime of thread)
    int id_blockOnSM    = blockIdx.x % numBlocksPerSM;                                      // Block number on this specific SM                                                 - (constant over lifetime of thread)
    int id_r            = id_blockOnSM  * (blockDim.x - 2*radius) + threadIdx.x - radius;   // Grid point number this thread is to work upon                                    - (constant over lifetime of thread)
    int id_grid         = id_sm         * numGridsPerSM;                                    // Grid ID this thread is to work upon                                              - (not constant over lifetime of thread)

    while(id_grid < numGridsPerSM * (id_sm + 1))    // this loops over numGridsPerSM grids
    {
        __syncthreads();
        int id_numInArray       = id_grid * numNodesPerGrid + id_r;     // Entry in array this thread is responsible for (read and possibly write)  - (not constant over lifetime of thread)
        float uchange           = 0.0f;
        //uchange                   = 1.0f;                                 // if this line is uncommented, results will be computed correctly ("Solution 1")
        float du                = 0.0f;

        if((threadIdx.x > radius-1) && (threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
        {
            if (id_r == 0)  // FO-forward difference
                du = (d_u[id_numInArray+1] - d_u[id_numInArray])/(d_r[id_numInArray+1] - d_r[id_numInArray]);
            else if (id_r == numNodesPerGrid - 1)  // FO-rearward difference
                du = (d_u[id_numInArray] - d_u[id_numInArray-1])/(d_r[id_numInArray] - d_r[id_numInArray-1]);
            else if (id_r == 1 || id_r == numNodesPerGrid - 2) //SO-central difference
                du = (d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1]);
            else if(id_r > 1 && id_r < numNodesPerGrid - 2)
                du = d_fourpoint_constant * ((d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1])) + (1-d_fourpoint_constant) * ((d_u[id_numInArray+2] - d_u[id_numInArray-2])/(d_r[id_numInArray+2] - d_r[id_numInArray-2]));
            else
                du = 0;
        }

        __syncthreads();
        if((threadIdx.x > radius-1 && threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
        {
            d_u[    id_numInArray] = d_u[id_numInArray] * uchange;          // if this line is commented out, results will be computed correctly ("Solution 2")
            d_du[   id_numInArray] = du;
        }

    __syncthreads();
    ++id_grid;
}

This kernel computes the derivative of some value at all grid points for a number of numerical 1D-grids.

Things to consider: (see full code base at the bottom)

    • a grid consists of 1300 grid points
    • each grid has to be worked upon by two blocks (due to memory/register limitations)
    • each block successively works on 37 grids (or better: grid halves, the while-loop takes care of that)
    • each thread is responsible for the same grid point in each grid
    • for the derivative to be computed, the threads need access to data from the four next grid points
    • in order to keep the blocks indepentend from each other, a small overlap on the grid is introduced (grid points 666, 667, 668, 669 of each grid are read from by two threads from different blocks, though only one thread is writing to them, it is this overlap where the problems occur)
    • due to the boiling down process, the two threads on each side of the blocks do no computations, in the original they are responsible for writing the corresponing grid values to shared memory

    The values of the grids are stored in u_arr, du_arr and r_arr (and their corresponding device arrays d_u, d_du and d_r). Each grid occupies 1300 consecutive values in each of these arrays. The while-loop in the kernel iterates over 37 grids for each block.

    To evaluate the workings of the kernel, each grid is initialized with the exact same values, so a deterministic program will produce the same result for each grid. This does not happen with my code.

    The weirdness of the Heisenbug:

    I compared the computed values of grid 0 with each of the other grids, and there are differences at the overlap (grid points 666-669), though not consistently. Some grids have the right values, some do not. Two consecutive runs will mark different grids as erroneous. The first thing that came to mind was that two threads at this overlap try to concurrently write to memory, though that does not seem to be the case (I checked.... and re-checked).

    Commenting or un-commenting lines or using printf() for debugging purposes will alter the outcome of the program as well: When "asking" the threads responsible for the grid points in question, they tell me that everything is allright, and they are actually correct. As soon as I force a thread to print out its variables, they will be computed (and more importantly: stored) correctly. The same goes for debugging with Nsight Eclipse.

    Memcheck / Racecheck:

    cuda-memcheck (memcheck and racecheck) report no memory/racecondition problems, though even the usage of one of these tools have the ability to impact the correctness of the results. Valgrind gives some warnings, though I think they have something to do with the CUDA API which I can not influence and which seem unrelated to my problem.

    (Update) As pointed out, cuda-memcheck --tool racecheck only works for shared memory race conditions, whereas the problem at hand has a race condition on d_u, i.e., global memory.

    Testing environment:

    The original kernel has been tested on different CUDA devices and with different compute capabilities (2.0, 3.0 and 3.5) with the bug showing up in every configuration (in some form or another).

    My (main) testsystem is the following:

    • 2 x GTX 460, tested on both the GPU that ran the X-server as well as the other one
    • Driver Version: 340.46
    • Cuda Toolkit 6.5
    • Linux Kernel 3.11.0-12-generic (Linux Mint 16 - Xfce)

    State of solution:

    By now I am pretty sure that some memory access is the culprit, maybe some optimization from the compiler or use of uninitialized values, and that I obviously do not understand some fundamental CUDA paradigm. The fact that printf() statements inside the kernel (which through some dark magic have to utilize device and host memory as well) and memcheck algorithms (cuda-memcheck and valgrind) influence the bevavior point in the same direction.

    I am sorry for this somewhat complicated kernel, but I boiled the original kernel and invocation down as much as I could, and this is as far as I got. By now I have learned to admire this problem, and I am looking forward to learning what is going on here.

    Two "solutions", which force the kernel do work as intended, are marked in the code.

    (Update) As mentioned in the correct answer below, the problem with my code is a race condition at the border of the thread-blocks. As there are two blocks working on each grid and there is no guarantee as to which block works first, resulting in the behavior outlined below. It also explains the correct results when employing "Solution 1" as mentioned in the code, because the input/output value d_u is not altered when uchange = 1.0.

    The simple solution is to split this kernel into two kernels, one computing d_u, the other computing the derivative d_du. It would be more desirable to have just one kernel invocation instead of two, though I do not know how to accomplish this with -arch=sm_20. With -arch=sm_35 one could probably use dynamic parallelism to achieve that, though the overhead for the second kernel invocation is negligible.

    heisenbug.cu:

    #include <cuda.h>
    #include <cuda_runtime.h>
    #include <stdio.h>
    
    const float r_sol = 6.955E8f;
    __constant__ float d_fourpoint_constant = 0.2f;
    
    __launch_bounds__(672, 2)
    __global__ void heisenkernel(float *d_u, float *d_r, float *d_du, int radius,
            int numNodesPerGrid, int numBlocksPerSM, int numGridsPerSM, int numGrids)
    {
        __syncthreads();
        int id_sm           = blockIdx.x / numBlocksPerSM;                                      // (arbitrary) ID of Streaming Multiprocessor (SM) this thread works upon           - (constant over lifetime of thread)
        int id_blockOnSM    = blockIdx.x % numBlocksPerSM;                                      // Block number on this specific SM                                                 - (constant over lifetime of thread)
        int id_r            = id_blockOnSM  * (blockDim.x - 2*radius) + threadIdx.x - radius;   // Grid point number this thread is to work upon                                    - (constant over lifetime of thread)
        int id_grid         = id_sm         * numGridsPerSM;                                    // Grid ID this thread is to work upon                                              - (not constant over lifetime of thread)
    
        while(id_grid < numGridsPerSM * (id_sm + 1))    // this loops over numGridsPerSM grids
        {
            __syncthreads();
            int id_numInArray       = id_grid * numNodesPerGrid + id_r;     // Entry in array this thread is responsible for (read and possibly write)  - (not constant over lifetime of thread)
            float uchange           = 0.0f;
            //uchange                   = 1.0f;                                 // if this line is uncommented, results will be computed correctly ("Solution 1")
            float du                = 0.0f;
    
            if((threadIdx.x > radius-1) && (threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
            {
                if (id_r == 0)  // FO-forward difference
                    du = (d_u[id_numInArray+1] - d_u[id_numInArray])/(d_r[id_numInArray+1] - d_r[id_numInArray]);
                else if (id_r == numNodesPerGrid - 1)  // FO-rearward difference
                    du = (d_u[id_numInArray] - d_u[id_numInArray-1])/(d_r[id_numInArray] - d_r[id_numInArray-1]);
                else if (id_r == 1 || id_r == numNodesPerGrid - 2) //SO-central difference
                    du = (d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1]);
                else if(id_r > 1 && id_r < numNodesPerGrid - 2)
                    du = d_fourpoint_constant * ((d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1])) + (1-d_fourpoint_constant) * ((d_u[id_numInArray+2] - d_u[id_numInArray-2])/(d_r[id_numInArray+2] - d_r[id_numInArray-2]));
                else
                    du = 0;
            }
    
            __syncthreads();
            if((threadIdx.x > radius-1 && threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
            {
                d_u[    id_numInArray] = d_u[id_numInArray] * uchange;          // if this line is commented out, results will be computed correctly ("Solution 2")
                d_du[   id_numInArray] = du;
            }
    
            __syncthreads();
            ++id_grid;
        }
    }
    
    bool gridValuesEqual(float *matarray, uint id0, uint id1, const char *label, int numNodesPerGrid){
    
        bool retval = true;
        for(uint i=0; i<numNodesPerGrid; ++i)
            if(matarray[id0 * numNodesPerGrid + i] != matarray[id1 * numNodesPerGrid + i])
            {
                printf("value %s at position %u of grid %u not equal that of grid %u: %E != %E, diff: %E\n",
                        label, i, id0, id1, matarray[id0 * numNodesPerGrid + i], matarray[id1 * numNodesPerGrid + i],
                        matarray[id0 * numNodesPerGrid + i] - matarray[id1 * numNodesPerGrid + i]);
                retval = false;
            }
        return retval;
    }
    
    int main(int argc, const char* argv[])
    {
        float *d_u;
        float *d_du;
        float *d_r;
    
        float *u_arr;
        float *du_arr;
        float *r_arr;
    
        int numNodesPerGrid = 1300;
        int numBlocksPerSM  = 2;
        int numGridsPerSM   = 37;
        int numSM           = 7;
        int TPB             = 672;
        int radius          = 2;
        int numGrids        = 259;
        int memsize_grid    = sizeof(float) * numNodesPerGrid;
    
        int numBlocksPerGrid    = numNodesPerGrid / (TPB - 2 * radius) + (numNodesPerGrid%(TPB - 2 * radius) == 0 ? 0 : 1);
    
        printf("---------------------------------------------------------------------------\n");
        printf("--- Heisenbug Extermination Tracker ---------------------------------------\n");
        printf("---------------------------------------------------------------------------\n\n");
    
        cudaSetDevice(0);
        cudaDeviceReset();
    
        cudaMalloc((void **) &d_u,      memsize_grid * numGrids);
        cudaMalloc((void **) &d_du,     memsize_grid * numGrids);
        cudaMalloc((void **) &d_r,      memsize_grid * numGrids);
    
        u_arr   = new float[numGrids * numNodesPerGrid];
        du_arr  = new float[numGrids * numNodesPerGrid];
        r_arr   = new float[numGrids * numNodesPerGrid];
    
        for(uint k=0; k<numGrids; ++k)
            for(uint i=0; i<numNodesPerGrid; ++i)
            {
                uint index  = k * numNodesPerGrid + i;
    
                if (i < 585)
                    r_arr[index] = i * (6000.0f);
                else
                {
                    if (i == 585)
                        r_arr[index] = r_arr[index - 1] + 8.576E-6f * r_sol;
                    else
                        r_arr[index] = r_arr[index - 1] + 1.02102f  * ( r_arr[index - 1] - r_arr[index - 2] );
                }
    
                u_arr[index]    = 1E-10f * (i+1);
                du_arr[index]   = 0.0f;
            }
    
        /*
        printf("\n\nbefore kernel start\n\n");
        for(uint k=0; k<numGrids; ++k)
            printf("matrix->du_arr[k*paramH.numNodes + 668]:\t%E\n", du_arr[k*numNodesPerGrid + 668]);//*/
    
        bool equal = true;
        for(int k=1; k<numGrids; ++k)
        {
            equal &= gridValuesEqual(u_arr, 0, k, "u", numNodesPerGrid);
            equal &= gridValuesEqual(du_arr, 0, k, "du", numNodesPerGrid);
            equal &= gridValuesEqual(r_arr, 0, k, "r", numNodesPerGrid);
        }
    
        if(!equal)
            printf("Input values are not identical for different grids!\n\n");
        else
            printf("All grids contain the same values at same grid points.!\n\n");
    
        cudaMemcpy(d_u, u_arr,      memsize_grid * numGrids, cudaMemcpyHostToDevice);
        cudaMemcpy(d_du, du_arr,    memsize_grid * numGrids, cudaMemcpyHostToDevice);
        cudaMemcpy(d_r, r_arr,      memsize_grid * numGrids, cudaMemcpyHostToDevice);
    
        printf("Configuration:\n\n");
        printf("numNodesPerGrid:\t%i\nnumBlocksPerSM:\t\t%i\nnumGridsPerSM:\t\t%i\n", numNodesPerGrid, numBlocksPerSM, numGridsPerSM);
        printf("numSM:\t\t\t\t%i\nTPB:\t\t\t\t%i\nradius:\t\t\t\t%i\nnumGrids:\t\t\t%i\nmemsize_grid:\t\t%i\n", numSM, TPB, radius, numGrids, memsize_grid);
        printf("numBlocksPerGrid:\t%i\n\n", numBlocksPerGrid);
        printf("Kernel launch parameters:\n\n");
        printf("moduleA2_3<<<%i, %i, %i>>>(...)\n\n", numBlocksPerSM * numSM, TPB, 0);
        printf("Launching Kernel...\n\n");
    
        heisenkernel<<<numBlocksPerSM * numSM, TPB, 0>>>(d_u, d_r, d_du, radius, numNodesPerGrid, numBlocksPerSM, numGridsPerSM, numGrids);
        cudaDeviceSynchronize();
    
        cudaMemcpy(u_arr, d_u,      memsize_grid * numGrids, cudaMemcpyDeviceToHost);
        cudaMemcpy(du_arr, d_du,    memsize_grid * numGrids, cudaMemcpyDeviceToHost);
        cudaMemcpy(r_arr, d_r,      memsize_grid * numGrids, cudaMemcpyDeviceToHost);
    
        /*
        printf("\n\nafter kernel finished\n\n");
        for(uint k=0; k<numGrids; ++k)
            printf("matrix->du_arr[k*paramH.numNodes + 668]:\t%E\n", du_arr[k*numNodesPerGrid + 668]);//*/
    
        equal = true;
        for(int k=1; k<numGrids; ++k)
        {
            equal &= gridValuesEqual(u_arr, 0, k, "u", numNodesPerGrid);
            equal &= gridValuesEqual(du_arr, 0, k, "du", numNodesPerGrid);
            equal &= gridValuesEqual(r_arr, 0, k, "r", numNodesPerGrid);
        }
    
        if(!equal)
            printf("Results are wrong!!\n");
        else
            printf("All went well!\n");
    
        cudaFree(d_u);
        cudaFree(d_du);
        cudaFree(d_r);
    
        delete [] u_arr;
        delete [] du_arr;
        delete [] r_arr;
    
        return 0;
    }
    

    Makefile:

    CUDA            = 1
    DEFINES         = 
    
    ifeq ($(CUDA), 1)
        DEFINES     += -DCUDA
        CUDAPATH    = /usr/local/cuda-6.5
        CUDAINCPATH = -I$(CUDAPATH)/include
        CUDAARCH    = -arch=sm_20
    endif
    
    CXX             = g++
    CXXFLAGS        = -pipe -g -std=c++0x -fPIE -O0 $(DEFINES)
    VALGRIND        = valgrind
    VALGRIND_FLAGS  = -v --leak-check=yes --log-file=out.memcheck
    CUDAMEMCHECK    = cuda-memcheck
    CUDAMC_FLAGS    = --tool memcheck
    RACECHECK       = $(CUDAMEMCHECK)
    RACECHECK_FLAGS = --tool racecheck  
    INCPATH         = -I. $(CUDAINCPATH)
    LINK            = g++
    LFLAGS          = -O0
    LIBS            = 
    
    ifeq ($(CUDA), 1)
        NVCC        = $(CUDAPATH)/bin/nvcc
        LIBS        += -L$(CUDAPATH)/lib64/ 
        LIBS        += -lcuda -lcudart -lcudadevrt
        NVCCFLAGS   = -g -G -O0 --ptxas-options=-v
        NVCCFLAGS   += -lcuda -lcudart -lcudadevrt -lineinfo --machine 64 -x cu $(CUDAARCH) $(DEFINES)
    endif 
    
    all: 
        $(NVCC) $(NVCCFLAGS) $(INCPATH) -c -o $(DST_DIR)heisenbug.o $(SRC_DIR)heisenbug.cu
        $(LINK) $(LFLAGS) -o heisenbug heisenbug.o $(LIBS)
    
    clean:
        rm heisenbug.o
        rm heisenbug
    
    memrace: all
        ./heisenbug > out
        $(VALGRIND) $(VALGRIND_FLAGS) ./heisenbug > out.memcheck.log
        $(CUDAMEMCHECK) $(CUDAMC_FLAGS) ./heisenbug > out.cudamemcheck
        $(RACECHECK) $(RACECHECK_FLAGS) ./heisenbug > out.racecheck
    
    1

    There are 1 answers

    1
    Robert Crovella On BEST ANSWER

    Note that in the entirety of your writeup, I do not see a question being explicitly asked, therefore I am responding to:

    I am looking forward to learning what is going on here.

    You have a race condition on d_u.

    by your own statement:

    •in order to keep the blocks indepentend from each other, a small overlap on the grid is introduced (grid points 666, 667, 668, 669 of each grid are read from by two threads from different blocks, though only one thread is writing to them, it is this overlap where the problems occur)

    Furthermore, if you comment out the write to d_u, according to your statement in the code, the problem disappears.

    CUDA threadblocks can execute in any order. You have at least 2 different blocks that are reading from grid points 666, 667, 668, 669. The results will be different depending on which case actually occurs:

    • both blocks read the value before any writes occur.
    • one block reads the value, then a write occurs, then the other block reads the value.

    The blocks are not independent of each other (contrary to your statement) if one block is reading a value that can be written to by another block. The order of block execution will determine the result in this case, and CUDA does not specify the order of block execution.

    Note that cuda-memcheck with the -tool racecheck option only captures race conditions related to __shared__ memory usage. Your kernel as posted uses no __shared__ memory, therefore I would not expect cuda-memcheck to report anything.

    cuda-memcheck, in order to gather its data, does influence the order of block execution, so it's not surprising that it affects the behavior.

    in-kernel printf represents a costly function call, writing to a global memory buffer. So it also affects execution behavior/patterns. And if you are printing out a large amount of data, exceeding the buffer lines of output, the effect is extremely costly (in terms of execution time) in the event of buffer overflow.

    As an aside, Linux Mint is not a supported distro for CUDA, as far as I can see. However I don't think this is relevant to your issue; I can reproduce the behavior on a supported config.