My goal is to set up a docker swarm on a group of 3 linux (ubuntu) physical workstations and run a dask cluster on that.
$ docker --version
Docker version 17.06.0-ce, build 02c1d87
I am able to init the docker swarm and add all of the machines to the swarm.
cordoba$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
j8k3hm87w1vxizfv7f1bu3nfg box1 Ready Active
twg112y4m5tkeyi5s5vtlgrap box2 Ready Active
upkr459m75au0vnq64v5k5euh * box3 Ready Active Leader
I then run docker stack deploy -c docker-compose.yml dask-cluster
on the Leader box.
Here is docker-compose.yml
:
version: "3"
services:
dscheduler:
image: richardbrks/dask-cluster
ports:
- "8786:8786"
- "9786:9786"
- "8787:8787"
command: dask-scheduler
networks:
- distributed
deploy:
replicas: 1
restart_policy:
condition: on-failure
placement:
constraints: [node.role == manager]
dworker:
image: richardbrks/dask-cluster
command: dask-worker dscheduler:8786
environment:
- "affinity:container!=dworker*"
networks:
- distributed
depends_on:
- dscheduler
deploy:
replicas: 3
restart_policy:
condition: on-failure
networks:
distributed:
and here is richardbrks/dask-cluster
:
# Official python base image
FROM python:2.7
# update apt-repository
RUN apt-get update
# only install enough library to run dask on a cluster (with monitoring)
RUN pip install --no-cache-dir \
psutil \
dask[complete]==0.15.2 \
bokeh
When I deploy the swarm, the dworker
nodes that are not on the same machine as dscheduler
does not know what dscheduler
is. I ssh'd into one of these nodes and looked in env,
and dscheduler
was not there. I also tried to ping dscheduler
, and got "ping: unknown host".
I thought docker was supposed to provide an internal dns based for service discovery
so that calling dscheduler
will take me to the address of the dschedler
node.
Is there some set up to my computers that I am missing? or are any of my files missing something?
All of this code is also located in https://github.com/MentalMasochist/dask-swarm
There was nothing wrong with dask or docker swarm. The problem was bad router firmware. After I went back to a prior version of the router firmware, the cluster worked fine.