Is it possible to execute multiple instances of a CUDA program on a multi-GPU machine?

2.1k views Asked by At

Background:

I have written a CUDA program that performs processing on a sequence of symbols. The program processes all sequences of symbols in parallel with the stipulation that all sequences are of the same length. I'm sorting my data into groups with each group consisting entirely of sequences of the same length. The program processes 1 group at a time.

Question:

I am running my code on a Linux machine with 4 GPUs and would like to utilize all 4 GPUs by running 4 instances of my program (1 per GPU). Is it possible to have the program select a GPU that isn't in use by another CUDA application to run on? I don't want to hardcode anything that would cause problems down the road when the program is run on different hardware with a greater or fewer number of GPUs.

2

There are 2 answers

3
Robert Crovella On BEST ANSWER

The environment variable CUDA_VISIBLE_DEVICES is your friend.

I assume you have as many terminals open as you have GPUs. Let's say your application is called myexe

Then in one terminal, you could do:

CUDA_VISIBLE_DEVICES="0" ./myexe

In the next terminal:

CUDA_VISIBLE_DEVICES="1" ./myexe

and so on.

Then the first instance will run on the first GPU enumerated by CUDA. The second instance will run on the second GPU (only), and so on.

Assuming bash, and for a given terminal session, you can make this "permanent" by exporting the variable:

export CUDA_VISIBLE_DEVICES="2"

thereafter, all CUDA applications run in that session will observe only the third enumerated GPU (enumeration starts at 0), and they will observe that GPU as if it were device 0 in their session.

This means you don't have to make any changes to your application for this method, assuming your app uses the default GPU or GPU 0.

You can also extend this to make multiple GPUs available, for example:

export CUDA_VISIBLE_DEVICES="2,4"

means the GPUs that would ordinarily enumerate as 2 and 4 would now be the only GPUs "visible" in that session and they would enumerate as 0 and 1.

In my opinion the above approach is the easiest. Selecting a GPU that "isn't in use" is problematic because:

  1. we need a definition of "in use"
  2. A GPU that was in use at a particular instant may not be in use immediately after that
  3. Most important, a GPU that is not "in use" could become "in use" asynchronously, meaning you are exposed to race conditions.

So the best advice (IMO) is to manage the GPUs explicitly. Otherwise you need some form of job scheduler (outside the scope of this question, IMO) to be able to query unused GPUs and "reserve" one before another app tries to do so, in an orderly fashion.

0
Flamefire On

There is a better (more automatic) way, which we use in PIConGPU that is run on huge (and different) clusters. See the implementation here: https://github.com/ComputationalRadiationPhysics/picongpu/blob/909b55ee24a7dcfae8824a22b25c5aef6bd098de/src/libPMacc/include/Environment.hpp#L169

Basically: Call cudaGetDeviceCount to get the number of GPUs, iterate over them and call cudaSetDevice to set this as the current device and check, if that worked. This check could involve test-creating a stream due to some bug in CUDA which made the setDevice succeed but all later calls failed as the device was actually in use. Note: You may need to set the GPUs to exclusive-mode so a GPU can only be used by one process. If you don't have enough data of one "batch" you may want the opposite: Multiple process submit work to one GPU. So tune according to your needs.

Other ideas are: Start a MPI-application with the same number of processes per rank as there are GPUs and use the same device number as the local rank number. This would also help in applications like yours that have different datasets to distribute. So you can e.g. have MPI rank 0 process length1-data and MPI rank 1 process length2-data etc.