i can't use more than 2 process in mpiexec

67 views Asked by At

I'm trying to train diffusion model below which is based on openai/guided-diffusion. code

The description in readme says to run image_train.py with this in terminal.

mpiexec -np 8 python3 ./image_train.py --datadir ./data/view_folder --savedir ./output --batch_size_train 12 --is_train True --save_interval 50000 --lr_anneal_steps 50000 --random_flip True --deterministic_train False --img_size 256

The origin used 8 gpus, but I have only 4 gpus so I changed the -np 8 to -np 4. However it didn't work with this error.

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.5
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 1 and rank 3 both on CUDA device 22000

Either 3. Only 1 or 2 worked, and if np < gpus then only np of gpus are used.

Since my training model uses about 60 GB for batch 1, I should use more GPU to increase batch (and performance).

I use mpiexec for the first time, so I do not have much understanding of this situation.

Can any one help? Thanks.

0

There are 0 answers