I have launched a dask.distributed
cluster via slurm (using dask-mpi
) across a number of cores on a slurm-managed cluster. All the processes appear to have started OK (normal-looking stdout in the slurm logfile), but when I try to connect a client from within python using client = Client(scheduler_file='/path/to/my/scheduler.json')
, I get a timeout error as follows:
distributed.utils - ERROR - Timed out trying to connect to 'tcp://141.142.181.102:8786' after 5 s: connect() didn't finish in time
Traceback (most recent call last):
File "/home/tmorton/.conda/envs/my_py3/lib/python3.6/site-packages/distributed/comm/core.py", line 185, in connect
quiet_exceptions=EnvironmentError)
File "/home/tmorton/.conda/envs/my_py3/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/home/tmorton/.conda/envs/my_py3/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
tornado.gen.TimeoutError: Timeout
These are the contents of scheduler.json
after launch. I don't know if not listing the worker processes here is normal or not, or if this signifies some problem with setup:
{
"type": "Scheduler",
"id": "Scheduler-d0f65756-1b50-43a6-a044-93e4ef047ab7",
"address": "tcp://141.142.181.102:8786",
"services": {
"bokeh": 8787
},
"workers": {}
}
I have gotten the same issue on two different slurm-managed clusters. Does it look like I need to specify something specific for the ports or something? If so, how do I go about figuring out what ports I need to use?