couldn't allocate an array of size 2116800 on device code

163 views Asked by At

I have a class Color with 3 double variable and another class Image with an array of Color. The problem is I can't allocate a Color array of size 1960*1080 on GPU code:

#include <iostream>

// limited version of checkCudaErrors from helper_cuda.h in CUDA examples
#define checkCudaErrors(val) check_cuda((val), #val, __FILE__, __LINE__)

void check_cuda(cudaError_t result, char const* const func, const char* const file,
                int const line) {
    if (result) {
        std::cerr << "CUDA error = " << static_cast<unsigned int>(result) << " at " << file << ":"
                << line << " '" << func << "' \n";
        // Make sure we call CUDA Device Reset before exiting
        cudaDeviceReset();
        exit(-1);
    }
}

class Color {
public:
    double r, g, b;

    __host__ __device__ Color() : r(0.0), g(0.0), b(0.0) {
    }
};

class Image {
public:
    int width = -1;
    int height = -1;

    Color* frame_buffer = nullptr;

    __device__ Image(int _width, int _height) : width(_width), height(_height) {
        frame_buffer = new Color[width * height];
    }

    __device__ ~Image() {
        delete frame_buffer;
    }
};

__global__ void init_gpu_image(Image* image, int width,
                               int height) {
    printf("block id:  (%d, %d, %d)\n", blockIdx.x, blockIdx.y, blockIdx.z);
    printf("thread id: (%d, %d, %d)\n", threadIdx.x, threadIdx.y, threadIdx.z);
    *image = Image(width, height);
}

int main() {
    int width = 1960;
    int height = 1080;

    printf("image dimension: %d\n", width * height);
    printf("image size: %d\n", sizeof(Color) * width * height);

    /*
    // works fine when allocating with cudaMallocManaged()
    Color* frame_buffer;
    checkCudaErrors(cudaMallocManaged((void **)&frame_buffer, sizeof(Color) * width * height));
    checkCudaErrors(cudaGetLastError());
    checkCudaErrors(cudaDeviceSynchronize());
    */

    Image* gpu_image;
    checkCudaErrors(cudaMallocManaged((void **)&gpu_image, sizeof(Image)));
    init_gpu_image<<<1, 1>>>(gpu_image, width, height);

    checkCudaErrors(cudaGetLastError());
    checkCudaErrors(cudaDeviceSynchronize());

    return 0;
}

full output:

image dimension: 2116800
image size: 50803200
block id:  (0, 0, 0)
thread id: (0, 0, 0)
CUDA error = 700 at /home/wentao/Desktop/cuda-test/main.cu:68 'cudaDeviceSynchronize()' 

CMakeLists.txt:

cmake_minimum_required(VERSION 3.18 FATAL_ERROR)

if (NOT CMAKE_CUDA_COMPILER)
    set(CMAKE_CUDA_COMPILER "/usr/local/cuda/bin/nvcc")
    # required by CLion
endif ()

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

project(cuda_test LANGUAGES CUDA CXX)

add_executable(cuda_test main.cu)

target_compile_options(cuda_test PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:
        --expt-relaxed-constexpr
        >)
1

There are 1 answers

4
Anis Ladram On

CUDA C++ new and delete operators in device code rely on device-side malloc (see this section of the CUDA C++ Programming Guide for more information), which most likely fails here given that it is not designed to handle large allocations like the 50MB you are trying to allocate.

In order to address the issue, you can manually extend the heap using:

cudaDeviceSetLimit(cudaLimitMallocHeapSize, sizeof(Color) * width * height + (1 << 24));

which is the size of the image, plus 16MiB to account for allocator metadata and alignment.

I recommend against allocating such a large allocation on the device heap given that it is not possible to resize it once it has been used, therefore alocating memory you cannot reclaim. Instead, I would recommend looking into cudaMallocAsync for allocating memory in the most efficient way. See this section of the CUDA runtime API documentation for more information on the stream-ordered memory allocation APIs.

For future reference, you can use NVIDIA's compute-sanitizer to debug these types of issues. compute-sanitizer clearly highlights the allocation failure (malloc(...) returns nullptr):

$ compute-sanitizer --show-backtrace device ./test
========= COMPUTE-SANITIZER
image dimension: 2116800
image size: 50803200
block id:  (0, 0, 0)
thread id: (0, 0, 0)
========= Invalid __global__ write of size 8 bytes
=========     at Image::Image(int, int)+0x3b0 in /tmp/tmp.QmgDY96jb8/test.cu:33
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x0 is out of bounds
=========     and is 8,675,917,824 bytes before the nearest allocation at 0x205200000 of size 8,388,864 bytes
=========     Device Frame:init_gpu_image(Image *, int, int)+0x290 in /tmp/tmp.QmgDY96jb8/test.cu:45
=========
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
=========
CUDA error = 719 at test.cu:68 'cudaDeviceSynchronize()'
========= Target application returned an error
========= ERROR SUMMARY: 2 errors