Timeout error when trying to connect dask.distributed client on slurm-managed cluster

1.6k views Asked by At

I have launched a dask.distributed cluster via slurm (using dask-mpi) across a number of cores on a slurm-managed cluster. All the processes appear to have started OK (normal-looking stdout in the slurm logfile), but when I try to connect a client from within python using client = Client(scheduler_file='/path/to/my/scheduler.json'), I get a timeout error as follows:

distributed.utils - ERROR - Timed out trying to connect to 'tcp://141.142.181.102:8786' after 5 s: connect() didn't finish in time
Traceback (most recent call last):
  File "/home/tmorton/.conda/envs/my_py3/lib/python3.6/site-packages/distributed/comm/core.py", line 185, in connect
    quiet_exceptions=EnvironmentError)
  File "/home/tmorton/.conda/envs/my_py3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/tmorton/.conda/envs/my_py3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
tornado.gen.TimeoutError: Timeout

These are the contents of scheduler.json after launch. I don't know if not listing the worker processes here is normal or not, or if this signifies some problem with setup:

{
  "type": "Scheduler",
  "id": "Scheduler-d0f65756-1b50-43a6-a044-93e4ef047ab7",
  "address": "tcp://141.142.181.102:8786",
  "services": {
    "bokeh": 8787
  },
  "workers": {}
}

I have gotten the same issue on two different slurm-managed clusters. Does it look like I need to specify something specific for the ports or something? If so, how do I go about figuring out what ports I need to use?

0

There are 0 answers