YoloV7 - Multi-GPU constantly gives RunTime Error

711 views Asked by At

I am using YoloV7 to run a training session for custom object detection. My environment is as follows:

OS: Ubuntu 22.04
Python : 3.10
Torch Version : '2.1.0+cu121'

I am using AWS EC2 - g5.2xlarge and g5.12xlarge instances for my training.

python3 train.py --batch 4 --data ~/yolo4iris/data.yaml --weights yolov7_training.pt

When I am using a g5.2xlarge instances which as 1gpu the training session runs without any issue. I am able to complete the training session. Since, I have more than 30k images I am trying to use g5.12xlarge instance that provides 4GPUs.

python -m torch.distributed.run --nproc_per_node 4 train.py --batch 64 --data ~/yolo4iris/data.yaml --weights yolov7_training.pt

I am using the above torch.distributed.run extension as given in Yolov7 documentation page. However, it gives me the following error.

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
YOLOR  v0.1-126-g84932d7 torch 2.1.0+cu121 CUDA:0 (NVIDIA A10G, 22546.9375MB)
                                            CUDA:1 (NVIDIA A10G, 22546.9375MB)
                                            CUDA:2 (NVIDIA A10G, 22546.9375MB)
                                            CUDA:3 (NVIDIA A10G, 22546.9375MB)

Namespace(weights='yolov7_training.pt', cfg='', data='/home/ubuntu/yolo4iris/data.yaml', hyp='data/hyp.scratch.p5.yaml', epochs=300, batch_size=64, img_size=[640, 640], rect=False, resume=False, nosave=False, notest=False, noautoanchor=False, evolve=False, bucket='', cache_images=False, image_weights=False, device='', multi_scale=False, single_cls=False, adam=False, sync_bn=False, local_rank=-1, workers=8, project='runs/train', entity=None, name='exp', exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias='latest', freeze=[0], v5_metric=False, world_size=4, global_rank=0, save_dir='runs/train/exp25', total_batch_size=64)
tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
Traceback (most recent call last):
  File "/home/ubuntu/yolov7/train.py", line 616, in <module>
hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.3, cls_pw=1.0, obj=0.7, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.2, scale=0.9, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.15, copy_paste=0.0, paste_in=0.15, loss_ota=1
    train(hyp, opt, device, tb_writer)
  File "/home/ubuntu/yolov7/train.py", line 85, in train
    with torch_distributed_zero_first(rank):
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/ubuntu/yolov7/utils/torch_utils.py", line 33, in torch_distributed_zero_first
    torch.distributed.barrier()
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3685, in barrier
    opts.device = _get_pg_default_device(group)
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 593, in _get_pg_default_device
    group = group or _get_default_group()
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 940, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Traceback (most recent call last):
  File "/home/ubuntu/yolov7/train.py", line 616, in <module>
    train(hyp, opt, device, tb_writer)
  File "/home/ubuntu/yolov7/train.py", line 85, in train
    with torch_distributed_zero_first(rank):
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/ubuntu/yolov7/utils/torch_utils.py", line 33, in torch_distributed_zero_first
    torch.distributed.barrier()
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3685, in barrier
    opts.device = _get_pg_default_device(group)
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 593, in _get_pg_default_device
    group = group or _get_default_group()
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 940, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Traceback (most recent call last):
  File "/home/ubuntu/yolov7/train.py", line 616, in <module>
    train(hyp, opt, device, tb_writer)
  File "/home/ubuntu/yolov7/train.py", line 85, in train
    with torch_distributed_zero_first(rank):
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/ubuntu/yolov7/utils/torch_utils.py", line 33, in torch_distributed_zero_first
    torch.distributed.barrier()
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3685, in barrier
    opts.device = _get_pg_default_device(group)
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 593, in _get_pg_default_device
    group = group or _get_default_group()
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 940, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
wandb: Install Weights & Biases for YOLOR logging with 'pip install wandb' (recommended)
Traceback (most recent call last):
  File "/home/ubuntu/yolov7/train.py", line 616, in <module>
    train(hyp, opt, device, tb_writer)
  File "/home/ubuntu/yolov7/train.py", line 85, in train
    with torch_distributed_zero_first(rank):
  File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/home/ubuntu/yolov7/utils/torch_utils.py", line 36, in torch_distributed_zero_first
    torch.distributed.barrier()
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3685, in barrier
    opts.device = _get_pg_default_device(group)
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 593, in _get_pg_default_device
    group = group or _get_default_group()
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 940, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
[2023-10-29 05:49:46,489] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4567 closing signal SIGTERM
[2023-10-29 05:49:46,903] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 4568) of binary: /home/ubuntu/yolo/bin/python
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/run.py", line 810, in <module>
    main()
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/yolo/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-10-29_05:49:46
  host      : ip-172-31-1-246.ap-south-1.compute.internal
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 4569)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-10-29_05:49:46
  host      : ip-172-31-1-246.ap-south-1.compute.internal
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 4570)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-29_05:49:46
  host      : ip-172-31-1-246.ap-south-1.compute.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 4568)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

While this sounds familiar as there are other questions with similar errors, the problem I am facing is none of the solutions help me resolve this issue. I tried the following:

  1. Setting OMP_NUM_THREADS variable in environment
  2. Changed local-rank to local_rank
  3. Changed my dataset
  4. Reinstalled YoloV7
  5. Modified the batch size from 32 to various combinations upto 4.
  6. Changed the image size from 640 to 256 and lower
  7. Ran the training session with only 2000 images.

Many other iterations and variations, but nothing seem to work for me. How can I resolve these three problems:

1.  Setting OMP_NUM_THREADS environment variable for each process to be 1 in default
2.  RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
3.  torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
2

There are 2 answers

0
Stelios Koroneos On

Your multi-GPU command looks like it is missing the master_port setting, thats why you are getting the error:

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Should be similar to this:

python -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 train.py "<rest of config>"
1
Stelios Koroneos On

Based on the cuda version you are running (you can see it with nvidia-smi) choose the last 1.12 version of pytorch that availiable. https://pytorch.org/get-started/previous-versions/ I think Yolov7 does not work with recent (i,e 2.x) versions of pytorch and also 1.13.0 is also excluded if you see in the requirments-gpu.txt