Pytorch distribute process across nodes and gpu

32 views Asked by At

I have access to 2 nodes, each has 2 GPUs. I want to have 4 processes, each has a GPU. I use nccl (if this is relevant).

Here is the Slurm script I tried. I tried different combinations of setup. It works occasionally as wanted. Most of time, it creates 4 processes in 1 node, and allocate 2 processes to 1 GPU. It slows down the program and cause out of memory, and makes all_gather fail.

How can I distribute processes correctly?

#!/bin/bash
#SBATCH -J jobname
#SBATCH -N 2
#SBATCH --cpus-per-task=10

# version 1
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
#SBATCH --gpu-bind=none

# version 2
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2

# version 3
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2

# version 4
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
#SBATCH --gpus-per-task=1

# # version 5
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --gpus-per-task=1

module load miniconda3
eval "$(conda shell.bash hook)"
conda activate gpu-env

nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

echo Node IP: $head_node_ip
export LOGLEVEL=INFO

export NCCL_P2P_LEVEL=NVL

srun  torchrun --nnodes 2 --nproc_per_node 2 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29678 mypythonscript.py  

In python script:

dist.init_process_group(backend="nccl")
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))

Log:

[W socket.cpp:464] [c10d] The server socket has failed to listen on [::]:29678 (errno: 98 - Address already in use).
[2024-03-31 15:46:06,691] torch.distributed.elastic.agent.server.local_elastic_agent: [INFO] log directory set to: /tmp/torchelastic_f6ldgsym/4556_xxbhwnb4
[2024-03-31 15:46:06,691] torch.distributed.elastic.agent.server.api: [INFO] [default] starting workers for entrypoint: python
[2024-03-31 15:46:06,691] torch.distributed.elastic.agent.server.api: [INFO] [default] Rendezvous'ing worker group
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:29678 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.

I am not sure if this is relevant, because for the successful cases, I also see this info.

UPDATE: I used to follow the tutorial from pytorch using torchrun. following this tutorial makes it work.

0

There are 0 answers