We are using Slurm 20.02 with NVML autodetect, and on some 8-GPU nodes with NVLink, 4-GPU jobs get allocated by Slurm in a surprising way that appears sub-optimal.
On a system with 8 Nvidia A40 GPUs, 4 NVLink bridges, and two AMD EPYC 7302 CPUs, we have the following topology:
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X NV4 SYS SYS SYS SYS SYS SYS 12-15,44-47 3
GPU1 NV4 X SYS SYS SYS SYS SYS SYS 8-11,40-43 2
GPU2 SYS SYS X NV4 SYS SYS SYS SYS 4-7,36-39 1
GPU3 SYS SYS NV4 X SYS SYS SYS SYS 0-3,32-35 0
GPU4 SYS SYS SYS SYS X NV4 SYS SYS 28-31,60-63 7
GPU5 SYS SYS SYS SYS NV4 X SYS SYS 24-27,56-59 6
GPU6 SYS SYS SYS SYS SYS SYS X NV4 20-23,52-55 5
GPU7 SYS SYS SYS SYS SYS SYS NV4 X 16-19,48-51 4
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NV# = Connection traversing a bonded set of # NVLinks
We see Slurm allocate 4-GPU jobs in groups such as [0,1,2,4], [1,2,3,7], [0,4,5,6] (using nvidia-smi numbering, not minor numbers, i.e., NUMA Affinity in the table above), with a pair of NVLinked GPUs and 2 unlinked GPUs.
We were expecting to see groups such as [0,1,2,3] or [0,1,4,5], with multiple pairs of NVLinked GPUs.
Some potentially relevant specs/settings:
# NVIDIA:
Driver Version: 460.32.03
CUDA Toolkit Version: 11.1
# slurm.conf:
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
AccountingStorageTRES=gres/gpu
JobAcctGatherType=jobacct_gather/linux
Questions:
- Is this behavior expected?
- Is there a way to force Slurm to allocate multiple pairs of NVLinked GPUs?