CUDA - Unified memory (Pascal at least)

1k views Asked by At

I would like some clarification about the unified memory, how it really works and how to effectively use it.

As far as I know, we use cudaMallocManaged(ptr, size); to allocate an array of unified memory. Since the Pascal architecture it's possible to set the size greater than the physical memory available on the GPU.

Suppose now I have a GC with 4GB of RAM, 32GB of RAM for the host and a file of 1TB. I want to address this 1TB file, how do I process ?

If I understand well, I can fit the file in the unified memory, but how does the link between this unified array and the file is performed ? Does it mean I have to memcpy the whole file in the pointer I have allocated with cudaMallocManaged ?

Finally, tell me if I'm right. If a miss is raised by the GPU the CPU will send the data it stored in its RAM and if not from the disk. It's a bit simplified but if it works like this it means the data needs to be in the unified array.

Thank you for your help.

2

There are 2 answers

5
Robert Crovella On BEST ANSWER

My response assumes you are running CUDA 9.x or higher, a Pascal or Volta GPU, on Linux.

You will be able to oversubscribe GPU memory up to approximately the size of host memory (i.e. whatever the host operating system allows you to allocate), less some reasonable amount that would be typical in any memory allocation process (you should not be expecting to allocate every last byte of host memory, and likewise should not attempt to do the same with managed memory allocations).

There is no link between unified memory and files or anything stored on disk.

Just as you probably cannot load that entire 1TB file into your 32GB of RAM, you cannot access it all at once using managed memory. Whatever amount the host operating system will allow you to allocate/load, is the size you will have available to the GPU.

Therefore, in order to process that 1TB file, you would probably need to come up with an algorithm that breaks it into pieces that fit in system RAM. This concept is completely independent of managed memory. Thereafter, if you want to access the piece of your file that is in system RAM using CUDA, you can use managed memory, including oversubscription, if you wish, to do so.

The exact process of breaking your file into pieces will depend on exactly what sort of processing you are doing, and has no particular dependency on CUDA.

1
Florent DUGUET On

It is possible to oversubscribe CPU memory on some systems. Using Power 9 + V100 on NVLink, you can use the operating system's Address Translation Service (ATS), as expressed here.

Doing so, it is possible to use 1TB of data from the GPU, even with a much smaller amount of RAM on the host system. The operations to be done are the following:

  1. Create a file - for backing the 1TB data - all you need is a file descriptor that can be mmapped.
  2. Use mmap to map the entire file on virtual address space (limit is 49 bits on the system of experiment, which is 512 TB).
  3. Pass that pointer to your kernel call.

What happens is that the operating system paging mechanism will page-in page-out chunks of your file on demand, and the GPU will rely on the ATS for this operation.

There is no mention, and no successful test, of such exercise on x86_64 and/or previous generation of GPU, and/or PCI-Express connected systems.