I have docker swarm of two nodes - manager node (aws instance) and worker node (multi-gpu rig on a desk next to me), both on Ubuntu 18.04 and Docker.io 19.03.6, build 369ce74a3c. On a worker node I set up nvidia-docker runtime and tested it (it works). On a manager node I set up an overlay network and now I'm trying to start service with gpu access and join it to my overlay network, but no luck - service isn't starting with assigned node no longer meets constraints
. How I start service:
docker service create --name=hw --constraint=node.id==xyriecy63n8995enp2mro0nvx --network=d9gqsljvmpy7 --generic-resource "gpu=1" busybox:latest sh -c "while true; do echo Hello; sleep 2; done"
And what status it has:
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
ur0uut7xq8qyjafejwt3xlbv4 hw.1 busybox:latest@sha256:d366a4665ab44f0648d7a00ae3fae139d55e32f9712c67accd604bb55df9d05a node-4 Ready Rejected less than a second ago "assigned node no longer meets constraints"
w83690e7dzcc56ahysp8s5xi9 \_ hw.1 busybox:latest@sha256:d366a4665ab44f0648d7a00ae3fae139d55e32f9712c67accd604bb55df9d05a node-4 Shutdown Rejected less than a second ago "node is missing network attachments, ip addresses may be exhausted"
Task details:
docker inspect ur0uut7xq8qyjafejwt3xlbv4
[
{
"ID": "ur0uut7xq8qyjafejwt3xlbv4",
"Version": {
"Index": 156466
},
"CreatedAt": "2020-10-13T06:53:54.822993602Z",
"UpdatedAt": "2020-10-13T06:54:00.063967596Z",
"Labels": {},
"Spec": {
"ContainerSpec": {
"Image": "busybox:latest@sha256:d366a4665ab44f0648d7a00ae3fae139d55e32f9712c67accd604bb55df9d05a",
"Args": [
"sh",
"-c",
"while true; do echo Hello; sleep 2; done"
],
"Init": false,
"DNSConfig": {},
"Isolation": "default"
},
"Resources": {
"Limits": {},
"Reservations": {
"GenericResources": [
{
"DiscreteResourceSpec": {
"Kind": "gpu",
"Value": 1
}
}
]
}
},
"Placement": {
"Constraints": [
"node.id==xyriecy63n8995enp2mro0nvx"
],
"Platforms": [
{
"Architecture": "amd64",
"OS": "linux"
},
{
"OS": "linux"
},
{
"OS": "linux"
},
{
"OS": "linux"
},
{
"Architecture": "arm64",
"OS": "linux"
},
{
"Architecture": "386",
"OS": "linux"
},
{
"Architecture": "mips64le",
"OS": "linux"
},
{
"Architecture": "ppc64le",
"OS": "linux"
},
{
"Architecture": "s390x",
"OS": "linux"
}
]
},
"Networks": [
{
"Target": "d9gqsljvmpy7wjrxa5q09bgtb"
}
],
"ForceUpdate": 0
},
"ServiceID": "mef68axo6ztmu7ojkiwcxxj0a",
"Slot": 1,
"NodeID": "xyriecy63n8995enp2mro0nvx",
"Status": {
"Timestamp": "2020-10-13T06:53:59.979035656Z",
"State": "rejected",
"Message": "preparing",
"Err": "node is missing network attachments, ip addresses may be exhausted",
"ContainerStatus": {
"ContainerID": "",
"PID": 0,
"ExitCode": 0
},
"PortStatus": {}
},
"DesiredState": "shutdown",
"NetworksAttachments": [
{
"Network": {
"ID": "d9gqsljvmpy7wjrxa5q09bgtb",
"Version": {
"Index": 32157
},
"CreatedAt": "2020-10-12T13:39:55.061260869Z",
"UpdatedAt": "2020-10-12T13:39:55.062498427Z",
"Spec": {
"Name": "testnet",
"Labels": {},
"DriverConfiguration": {
"Name": "overlay"
},
"Attachable": true,
"IPAMOptions": {
"Driver": {
"Name": "default"
},
"Configs": [
{
"Subnet": "172.25.0.0/16",
"Gateway": "172.25.0.1"
}
]
},
"Scope": "swarm"
},
"DriverState": {
"Name": "overlay",
"Options": {
"com.docker.network.driver.overlay.vxlanid_list": "4097"
}
},
"IPAMOptions": {
"Driver": {
"Name": "default"
},
"Configs": [
{
"Subnet": "172.25.0.0/16",
"Gateway": "172.25.0.1"
}
]
}
},
"Addresses": [
"172.25.96.221/16"
]
}
],
"GenericResources": [
{
"NamedResourceSpec": {
"Kind": "gpu",
"Value": "GPU-50fd60c4"
}
}
]
}
]
My overlay network:
docker inspect d9gqsljvmpy7
[
{
"Name": "testnet",
"Id": "d9gqsljvmpy7wjrxa5q09bgtb",
"Created": "2020-10-12T13:39:55.061260869Z",
"Scope": "swarm",
"Driver": "overlay",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.25.0.0/16",
"Gateway": "172.25.0.1"
}
]
},
"Internal": false,
"Attachable": true,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": null,
"Options": {
"com.docker.network.driver.overlay.vxlanid_list": "4097"
},
"Labels": null
}
]
Service starts normally without ether --network
or --generic-resource
. Starting without --network
and attaching after start also doesn't work.
I enabled debug logs on both nodes but didn't see anything suspicious other than same error message:
Oct 12 13:40:45 node-4 dockerd[1166]: time="2020-10-12T13:40:45.975574449Z" level=error msg="fatal task error" error="node is missing network attachments, ip addresses may be exhausted" module=node/agent/taskmanager node.id=xyriecy63n8995enp2mro0nvx service.id=mef68axo6ztmu7ojkiwcxxj0a task.id=twcbj9emeopm2qfq0i7lwftbe
Also I tested network exhaustion with docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock docker/ip-util-check
and obviously it finds nothing:
Overlay IP Utilization Report
----
Network testnet/d9gqsljvmpy7 has an IP address capacity of 65533 and uses 0 addresses spanning over 0 nodes
Network OK: network will have 49149 available IPs before passing the 75% subnet use
So, how can one start gpu-tied service and attach it to overlay network?
Apparently, there is no need to specify
--generic-resource
in my case. Without it service has access to all gpus, listed to docker via--node-generic-resource gpu=xxx
. Downside is you can't control gpu count per service, but I can live with it.