PyTorch torchrun command can not find rendezvous endpoint, RendezvousConnectionError

Question

PyTorch torchrun command can not find rendezvous endpoint, RendezvousConnectionError

1.3k views Asked by GeSol At 23 September 2023 at 18:55

I'm practicing PyTorch for multiple node DDP on a docker container, and my program runs properly when I run

torchrun \
        --nnodes=1 \
        --node_rank=0 \
        --nproc_per_node=gpu \
        --rdzv_id=123  \
        --rdzv-backend=c10d \
        --rdzv-endpoint=localhost:10000 \
        test_code.py

However, when I run

torchrun \
        --nnodes=1 \
        --node_rank=0 \
        --nproc_per_node=gpu \
        --rdzv_id=1024  \
        --rdzv-backend=c10d \
        --rdzv-endpoint=192.168.9.225:10000 \
        07-5-pytorch-ddp-multiple-nodes.py

it stucks and then occurs errors as below

master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[E socket.cpp:860] [c10d] The client socket has timed out after 60s while trying to connect to (192.168.9.225, 10000).
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 155, in _create_tcp_store
    store = TCPStore(
TimeoutError: The client socket has timed out after 60s while trying to connect to (192.168.9.225, 10000).

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 223, in launch_agent
    rdzv_handler=rdzv_registry.get_rendezvous_handler(rdzv_parameters),
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 65, in get_rendezvous_handler
    return handler_registry.create_handler(params)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/api.py", line 257, in create_handler
    handler = creator(params)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36, in _create_c10d_handler
    backend, store = create_backend(params)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 250, in create_backend
    store = _create_tcp_store(params)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 175, in _create_tcp_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

My docker container was made by

docker run -it --gpus=all --ipc=host --network=host --cap-add=NET_ADMIN  --name=pytorch-2.0-examples -v=pytorch-2.0-examples:/pytorch-2.0-examples pytorch/pytorch /bin/bash

and ping test is fine, firewall is disabled for the test.

How can I run PyTorch torchrun with an IP address that is not 127.0.0.1?

My program runs well when --rdzv-endpoint is localhost or 127.0.0.1, but not when other IP address of my machine starts with 192 or 172.

Original Q&A

There are 1 answers

**GeSol** · Answer 1 · 2023-09-23T19:17:08+00:00

GeSol On 23 September 2023 at 19:17

It runs well after I added peer server IP and name on hosts (/etc/hosts of Ubuntu) file

TechQA.

PyTorch torchrun command can not find rendezvous endpoint, RendezvousConnectionError

There are 1 answers

Related Questions in PYTORCH

Related Questions in DDP

Related Questions in DISTRIBUTED-TRAINING

Popular Questions

Trending Questions