My code is showing
CUDA error calling "cudaStreamSynchronize(0)", code is 700 an illegal memory access was encountered on 489
 0# my_func(signed char const*, unsigned char const*, int*, int*, int, int) in libthing.so
How can I debug this?
 On
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                Here's a comprehensive guide to discovering what will probably be a stupid mistake. The first two steps are somewhat superfluous because we'll ultimately use a debugger, but they're very good ideas for helping isolate a variety of problems.
My output looked like this
CUDA error calling "cudaStreamSynchronize(0)", code is 700 an illegal memory access was encountered on 489
where Line 489 is the first time I checked a CUDA API call for errors. This has almost nothing to do with where the error happened.
Since we can also see CUDA errors from the GPU if we check with them, we need to make sure our code checks for them any time we interact with the GPU.
To do so, I use the following header file to define error-checking functions:
#pragma once
#include <boost/stacktrace.hpp>
#include <iostream>
#include <stdexcept>
#define STRINGIZE_DETAIL(x) #x
#define STRINGIZE(x) STRINGIZE_DETAIL(x)
#define CUDA_CHECK(call)                                                                                               \
  do {                                                                                                                 \
    if ((call) != cudaSuccess) {                                                                                       \
      const cudaError_t err = cudaGetLastError();                                                                      \
      std::cerr << "CUDA error calling \"" #call "\", code is " << err << " " << cudaGetErrorString(err) << " on "     \
                << __LINE__ << "\n"                                                                                    \
                << boost::stacktrace::stacktrace() << std::endl;                                                       \
      throw std::runtime_error("Problem.");                                                                            \
    }                                                                                                                  \
  } while (0)
#define CUDA_KERNEL_LAUNCH_CHECK()                                                                                     \
  do {                                                                                                                 \
    const auto cuda_err = cudaGetLastError();                                                                          \
    if (cuda_err != cudaSuccess) {                                                                                     \
      throw std::runtime_error(std::string("CUDA kernel launch failed! ") + cudaGetErrorString(cuda_err));             \
    }                                                                                                                  \
  } while (0)
If your code looked like this beforehand:
cudaMemcpy(dst, src, 20*sizeof(float), cudaMemcpyHostToDevice);
my_kernel<<<blocks, threads, shared_mem, stream>>>(dst, result);
after adding error checking it will look like
CUDA_CHECK(cudaMemcpy(dst, src, 20*sizeof(float), cudaMemcpyHostToDevice));
my_kernel<<<blocks, threads, shared_mem, stream>>>(dst, result);
CUDA_KERNEL_LAUNCH_CHECK();
Note that you want to have a CUDA_KERNEL_LAUNCH_CHECK() immediately following every kernel launch.
Now that your code is checking for errors you'll see problems closer to where they happened.
Run your program with
CUDA_LAUNCH_BLOCKING=1 ./my_program.exe
CUDA_LAUNCH_BLOCKING will cause each kernel to finish running before moving on to the next line. Since every kernel launch is followed by CUDA_KERNEL_LAUNCH_CHECK() now this will tell you exactly which kernel caused the problem.
Compile your code like
nvcc -g G my.cu source.cu files.cu
This ensures that you'll have access to the CUDA source code inside the debugger and be able to step through it.
Run your code like so.
cuda-gdb --silent --ex run --args ./my_program.exe
whereas running the program without the debugger gave an address in the C++ host source:
CUDA error calling "cudaStreamSynchronize(0)", code is 700 an illegal memory access was encountered on 489
running the program in cuda-gdb gives:
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x10000320778 (cuda.cu:414)
Thread 1 "benchmark.exe" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 15, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x000001000031f450 in my_kernel<<<(1,1,1),(128,1,1)>>> (rec=0x7fffc7c00000 '\377' <repeats 200 times>, arg1=0x7fffce800000, arg2=0x7fffc7de8a00, 
    arg3=0x555555a7a9a0, arg4=0x7fffc7cf4400 "", arg5=1000, arg6=1000) at my_source_file.cu:382
382       const auto ci = order[i];
The debugger has identified that Line 414 is the problem - and it is!
That lines looks like this
array[i] = val;
we suspect that the assignment is causing a problem because that involves an address. But why?
Notes that all of the arguments have pointers beginning with 0x7fffc except for argument 5 which is 0x555555a7a9a0. This isn't hexadecimal garbage - it's actually the source of the problem. CUDA hasn't allocated a single block of memory at 0x555555a7a9a0 far distant from all the other addresses. Instead, a host pointer has been passed to the device and that is causing the failure. This is why the values of the pointers are so different.
Note that although the debugger appears to have stopped at
382       const auto ci = order[i];
this line has nothing to do with the problem. I'm not sure why the compiler stopped here.
While using
cuda-gdbto debug kernels is generally a good skill to develop (See Richard's answer), there is an easier, less interactive way of debugging this kind of memory access error. Nvidia has been shipping their memory checking tool with the CUDA Toolkit for a long time, first ascuda-memcheck(not available in CUDA 12 anymore) and nowadays as part of thecompute-sanitizertool (available since CUDA 11).Both tools are used by passing them the name of the program to check which will then instantly be run without further user input required which makes them usable for e.g. automated CI jobs making sure that there are no correctness regressions. Different checkers can be applied, but
memcheckis the default checker and needs no further command line arguments to be chosen.If you compile a debug build (
-Gfor device code) or just compile with-lineinfo, these tools are not only able to tell you which thread in which kernel does an illegal access of which kind and size, but also pinpoint in which line of code that access is made.The ComputeSanitizer documentation also gives an example of how the output looks like:
So
memcheck_demo.cu:6points to line 6 of the demo source code.Further demonstrations can be found in the NVIDIA/compute-sanitizer-samples repository.