CudaMalloc not allocating to shared GPU memory on Ampere micro-architectures in opposite to Turing

239 views Asked by At

I am using CUDA Toolkit 11.8 - operation system - Windows.
The program is compiled on Windows with Nvidia T1200.
I use CMAKE 3.15.7 with flag:

CMAKE_CUDA_ARCH_BIN: 3.5 5.0 5.2 5.3 6.0 6.1 6.2 7.0 7.2 7.5 8.0 8.6 8.7 8.9 9.0 

Problem

When I run the code on my laptop (T1200) I can see, that CudaMalloc allocates memory first on dedicated GPU memory and when there is no more space available there, it automagicaly allocates data to shared GPU memory.

When the same code is run on RTX 3070 CudaMalloc allocates memory first on dedicated GPU memory, but when it is full it no longer allocates data to shared GPU memory but it reports cudaErrorMemoryAllocation

And I don't understand why the behavior on separate GPU-s is different.

I build the program on the laptop, than copy exec file to the machine with RTX 3070 and execute program there.

The code

Working Google Test Example

class CudaTestMemory
{
public:
    unsigned char *cudaBuffer;
};

TEST_F(Cuda, cudaAllocator)
{
    CudaTestMemory *costBuffer[50];
    for (int i = 0; i < 50; i++)
    {
        costBuffer[i] = new CudaTestMemory();

        size_t allocatedMemory = 1 * 1024 * 1024 * 1024; // 1GB
        cudaError_t cudaStatus = cudaMalloc(&costBuffer[i]->cudaBuffer, allocatedMemory);

        if (cudaStatus != cudaSuccess)
        {
            // we failed - cannot allocate XXXX GPU Memory
            costBuffer[i]->cudaBuffer = nullptr;
            throw std::runtime_error(cudaGetErrorString(cudaStatus));
        }
    }
}

Result on NVIDIA T1200

NVIDIA T1200 has 4 GB of GPU dedicated memory + 32 GB of shared GPU memory (computer RAM). The test crashes there on iteration number 34 when there is no more space available on GPU (device + shared)

Result on NVIDIA RTX 3070

NVIDIA RTX 3070 has 8 GB of dedicated memory + 128 GB of shared GPU memory (computer RAM). The test crashes there on iteration number 7 when there is no more memory on GPU (just device).

1

There are 1 answers

1
talonmies On BEST ANSWER

As described in comments, CUDA on WDDM Windows machines functions differently than other operating systems, in that the WDDM subsystem handles GPU compute memory allocation and scheduling, rather than being handled by the CUDA runtime as on other platforms.

The Windows GPU memory virtualization functionality is a property of Windows rather than CUDA, and there isn't anything that you can do in CUDA user code to change its behaviour. You can read about it here.

Unfortunately, I don't believe there is a fix for this.