How to runtime detect when CUDA-aware MPI will transmit through RAM?

67 views Asked by At

I wish to determine, at runtime, whether CUDA-aware MPI is about to sendrecv data directly between VRAM of my GPUs, or whether it is going to silently fall-back to routing the data through RAM. I wish to avoid the latter scenario, because I can do that cloning faster, and avoid out-of-memory issues. I must perform this check at runtime because it is possible (like in heterogoneous settings) that some inter-MPI-process exchanges will be VRAM to VRAM, while others will fall-back to RAM routing. I need to perform this check in a robust way for integration into a library.

This question uses a lot of abbreviations and sloppy phrasing to remain concise, which I summarise in a glossary at the bottom.

Context

My distributed C++ application involves swapping significant amounts of data (e.g. 64 GiB) between the VRAM of (very beefy) CUDA GPUs. A user can compile with "regular" MPI, or CUDA-aware MPI, and the communicating code logic resembles:

function swapGPUArrays():

    if MPI is CUDA-aware:
        exchange VRAM pointers directly

    else:
        cudaMemcpy VRAM to RAM
        exchange RAM pointers
        cudaMemcpy RAM to VRAM

function exchangeArrays():

    partition arrays into maximum-sized MPI messages (about `16 GiB`)

    asynchronously send/recv each message

    wait for all asynchs to finish

The code uses this check to determine if the MPI-compiler is CUDA-aware (and can ergo directly sendrecv CUDA device memory) and otherwise falls back to copying the device memory to permanent RAM arrays, which are then exchanged. Note that because the exchanged memory is too large for a single MPI message, it is divided into several messages; these are asynchronously exchanged (so their transmission can occur simultaneously), as per this work.

This code works great in the following scenarios:

  • a non-CUDA aware MPI compiler is used; the memory is exchanged through RAM
  • a UCX-enabled CUDA-aware MPI compiler is used; the VRAM pointers are directly exchanged, and behind the scenes, are done so using optimised methods (e.g. peer-to-peer direct inter-GPU communication, when permitted by things like NVLink).

Problem

Consider the scenario where a user compiles this code with CUDA-aware MPI, but not all of their GPUs directly connected to the network/interconnect. This means that at runtime, the calls to the CUDA-aware MPI's sendrecv will sometimes secretly route the messages through RAM, in a similar spirit to my code above. This runs correctly when there is sufficient free memory in RAM (for the temporary RAM buffers) but even in that scenario, it runs slower than my manual process:

  • In my manual copy, I cudaMemcpy the entirety of the data from VRAM to RAM in one call. I then subsequently split the data into separate inter-RAM messages.
  • The CUDA-aware MPI makes the VRAM-to-RAM copies for each message. Execution has already reached function exchangeArrays() and divided the payload into messages before MPI opts to copy them to RAM.

This means that letting CUDA-aware MPI "take the wheel" results in a slower RAM-to-VRAM copy, by virtue of being split into many smaller copies. For my testing, this is about 2x as slow.

However, when there is not sufficient RAM space for the CUDA-aware MPI's temporary communication buffer, the application will crash. In contrast, my manual copying uses persistent RAM arrays (needed elsewhere in the application) and avoids temporary creation, avoiding the crash. There does not appear to be a way to inform CUDA-aware MPI to fall-back to routing RAM via an existing array.

Sought solution

If I knew in advance that the CUDA-aware MPI was going to route through RAM anyway, I could opt to do it myself through a single cudaMemcpy call. So I seek a function like

isMpiGoingToRouteThroughRAM()

which I could then query to change my original if statement to:

if (MPI is CUDA-aware) and (not isMpiGoingToRouteThroughRAM()):
    ...

Necessity of a runtime check

I must perform this check at runtime, rather than compile time, to support heterogeneous platforms like hybrid tightly+loosely-coupled multi-GPU machines.

For example, imagine a user is running my application between two ethernet-connected multi-GPU machines. Within each machine there are 8 tightly-coupled GPUs, e.g. connected by NVLink, which can all make use of GPUDirect for optimised inter-GPU intra-machine communication. A CUDA-aware MPI with UCX will automatically do this for me, via my VRAM-to-VRAM sendrecv.

But consider the ethernet connection between the two machines, to which each GPU does not have direct access. Communication between GPUs in distinct machines must ergo transmit through RAM. UCX will automatically fall back to using (e.g.) TCP and route the messages through RAM, allocating temporary RAM buffers. This allocation may fail due to insufficient memory, and is otherwise slow, etc. So in that instance, I must detect this scenario and do the VRAM-to-RAM copy myself using the persistent, existing RAM buffer.

Necessity of CUDA-aware MPI

A complicated but potential solution is to not use a CUDA-aware MPI version at all. Instead, I could manually detect when peer-to-peer GPU communication is possible with functions like cudaDeviceCanAccessPeer() (or when GPUs are directly connected to an interconnect), and manually fall-back to non-CUDA-aware MPI (as per my VRAM to VRAM copying).

Alas, this is not possible. My application is necessarily distributed with MPI, and the CUDA facilities for probing connectedness and exchanging memory between peers are not MPI-friendly (in my understanding). Even attempting to disable distribution would require I explicitly manage the inter-GPU communication buffers, which is a large tedium in my setting.

Other painful considerations

  • I wish to support most (if not all) major CUDA-aware MPI implementations (e.g. OpenMPI, MPICH, Intel, IBM, RocM), to keep compiler compatibility of my library wide. Requiring recent versions is fine. I am prepared to make a pre-processor nightmare to facilitate this.

Glossary

  • buffer: an array dedicated to receiving MPI messages, or data from other processes.
  • CUDA: both NVIDIA's GPU programming model, and that supported by other vendors like AMD (through ROCm/HIPCC)
  • CUDA-aware MPI: An MPI compiler which supports message passing directly between VRAM. For instance, MPI_Isend() gracefully accepts a pointer to an array in device memory. At runtime, the messages might be secretly passed via GPUDirect, or still routed through RAM, etc.
  • GPUDirect: a technology/method for interconnected GPUs to exchange memory directly, by writing to one another's buffers through a highspeed interconnect such as NVLink. Such a facility can be automatically leveraged (using MPI_ISend() etc) by a CUDA-aware MPI compiler with UCX.
  • loosely-coupled: a multi-GPU environment where GPUs do not have direct interconnects (i.e. does not use GPUDirect). Exchanging memory between GPUs requires routing through RAM.
  • MPI: The message-passing standard. I wish to use standard facilities implemented by as many major-version compilers as possible, such as OpenMPI, MPICH, ROCm, Intel MPI, IBM Spectrum, etc.
  • NVLink: a GPU interconnect which enables GPUDirect.
  • RAM: memory directly accessible by the CPU, which is not directly accessible by CUDA kernels. All MPI configurations can pass messages to/from RAM, but only some can exchange between VRAM.
  • routing: the copying of an array between RAM and VRAM in order to process that array by a CPU or GPU.
  • tightly-coupled: a multi-GPU environment where GPUs are directly interconnected (and so can make use of GPUDirect). Passing messages between them with a CUDA-aware MPI can utilise GPUDirect and avoid routing through RAM.
  • UCX: an MPI extension which enables CUDA-aware MPI compilers to automatically translate message passes into GPUDirect exchanges (among other things).
  • VRAM: GPU device memory, directly accessible only by CUDA kernels.

Diagrams

Here are some hideous diagrams to clarify the configurations I am considering.

Loosely-connected multi-GPU:

GPUs do not share a highspeed interconnect. All communication between GPUs must be routed through RAM. This is typical in university HPC clusters.

enter image description here

Tightly-connected multi-GPU:

GPUs share a highspeed interconnect such as NVLink (pictured in red). They can ergo make use of GPUDirect to communicate, as may be automatically invoked by a CUDA-aware MPI with UCX when message passing.

enter image description here

Hybrid loosely+tightly-coupled multi-GPU:

Multiple machines are connected by a (relatively) slow interconnect. Within each machine, multiple GPUs are connected by a fast interconnect (e.g. NVLink). Inter-GPU communication within a machine can make use of GPUDirect, but between machines requires routing through RAM.

enter image description here

0

There are 0 answers