I am trying to allocate 2 GPUs and run 1 python script over these 2 GPUs. The python script requires the variables $AMBERHOME, which is obtained by sourcing the amber.sh script, and $CUDA_VISIBLE_DEVICES. The $CUDA_VISIBLE_DEVICES variable should equal something like 0,1 for the two GPUS I have requested.
Currently, I have been experimenting with this basic script.
#!/bin/bash
#
#BATCH --job-name=test
#SBATCH --output=slurm_info
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --time=5:00:00
#SBATCH --partition=gpu-v100
## Prepare Run
source /usr/local/amber20/amber.sh
export CUDA_VISIBLE_DEVICES=0,1
## Perform Run
python calculations.py
When I run the script, I can see that 2 GPUs are requested.
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
11111 GPU test jsmith CF 0:02 2 gpu-[1-2]
When I look at the output ('slurm_info') I see,
cpu-bind=MASK - gpu-1, task 0 0 [10111]: mask 0x1 set
and of course information about the failed job.
Typically when I run this script on my local workstation, I have 2 GPUs there and when entering nvidia-smi into the command line, I see...
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:00:1E.0 Off | 0 |
| N/A 29C P0 24W / 300W | 0MiB / 16160MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... On | 00000000:00:1E.0 Off | 0 |
| N/A 29C P0 24W / 300W | 0MiB / 16160MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
However, when I use nvidia-smi with my previous batch script on the cluster I see the following.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:00:1E.0 Off | 0 |
| N/A 29C P0 24W / 300W | 0MiB / 16160MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
This makes me think that when the python script runs it is only seeing the one GPU.
You are requesting two nodes, not two GPUs. The correct syntax for requesting GPUs depends on the Slurm version and how your cluster is set up. But you generally use
#SBATCH -G 2
to request two GPUs.Slurm usually also takes care of
CUDA_VISIBLE_DEVICES
for you, so no need for that. Try this: