I have docker swarm of two nodes - manager node (aws instance) and worker node (multi-gpu rig on a desk next to me), both on Ubuntu 18.04 and Docker.io 19.03.6, build 369ce74a3c. On a worker node I set up nvidia-docker runtime and tested it (it works). On a manager node I set up an overlay network and now I'm trying to start service with gpu access and join it to my overlay network, but no luck - service isn't starting with assigned node no longer meets constraints. How I start service:

docker service create --name=hw --constraint=node.id==xyriecy63n8995enp2mro0nvx --network=d9gqsljvmpy7 --generic-resource "gpu=1" busybox:latest sh -c "while true; do echo Hello; sleep 2; done"

And what status it has:

ID                          NAME                IMAGE                                                                                    NODE                DESIRED STATE       CURRENT STATE                     ERROR                                                                  PORTS
ur0uut7xq8qyjafejwt3xlbv4   hw.1         busybox:latest@sha256:d366a4665ab44f0648d7a00ae3fae139d55e32f9712c67accd604bb55df9d05a   node-4             Ready               Rejected less than a second ago   "assigned node no longer meets constraints"
w83690e7dzcc56ahysp8s5xi9    \_ hw.1     busybox:latest@sha256:d366a4665ab44f0648d7a00ae3fae139d55e32f9712c67accd604bb55df9d05a   node-4             Shutdown            Rejected less than a second ago   "node is missing network attachments, ip addresses may be exhausted"

Task details:

docker inspect ur0uut7xq8qyjafejwt3xlbv4
[
    {
        "ID": "ur0uut7xq8qyjafejwt3xlbv4",
        "Version": {
            "Index": 156466
        },
        "CreatedAt": "2020-10-13T06:53:54.822993602Z",
        "UpdatedAt": "2020-10-13T06:54:00.063967596Z",
        "Labels": {},
        "Spec": {
            "ContainerSpec": {
                "Image": "busybox:latest@sha256:d366a4665ab44f0648d7a00ae3fae139d55e32f9712c67accd604bb55df9d05a",
                "Args": [
                    "sh",
                    "-c",
                    "while true; do echo Hello; sleep 2; done"
                ],
                "Init": false,
                "DNSConfig": {},
                "Isolation": "default"
            },
            "Resources": {
                "Limits": {},
                "Reservations": {
                    "GenericResources": [
                        {
                            "DiscreteResourceSpec": {
                                "Kind": "gpu",
                                "Value": 1
                            }
                        }
                    ]
                }
            },
            "Placement": {
                "Constraints": [
                    "node.id==xyriecy63n8995enp2mro0nvx"
                ],
                "Platforms": [
                    {
                        "Architecture": "amd64",
                        "OS": "linux"
                    },
                    {
                        "OS": "linux"
                    },
                    {
                        "OS": "linux"
                    },
                    {
                        "OS": "linux"
                    },
                    {
                        "Architecture": "arm64",
                        "OS": "linux"
                    },
                    {
                        "Architecture": "386",
                        "OS": "linux"
                    },
                    {
                        "Architecture": "mips64le",
                        "OS": "linux"
                    },
                    {
                        "Architecture": "ppc64le",
                        "OS": "linux"
                    },
                    {
                        "Architecture": "s390x",
                        "OS": "linux"
                    }
                ]
            },
            "Networks": [
                {
                    "Target": "d9gqsljvmpy7wjrxa5q09bgtb"
                }
            ],
            "ForceUpdate": 0
        },
        "ServiceID": "mef68axo6ztmu7ojkiwcxxj0a",
        "Slot": 1,
        "NodeID": "xyriecy63n8995enp2mro0nvx",
        "Status": {
            "Timestamp": "2020-10-13T06:53:59.979035656Z",
            "State": "rejected",
            "Message": "preparing",
            "Err": "node is missing network attachments, ip addresses may be exhausted",
            "ContainerStatus": {
                "ContainerID": "",
                "PID": 0,
                "ExitCode": 0
            },
            "PortStatus": {}
        },
        "DesiredState": "shutdown",
        "NetworksAttachments": [
            {
                "Network": {
                    "ID": "d9gqsljvmpy7wjrxa5q09bgtb",
                    "Version": {
                        "Index": 32157
                    },
                    "CreatedAt": "2020-10-12T13:39:55.061260869Z",
                    "UpdatedAt": "2020-10-12T13:39:55.062498427Z",
                    "Spec": {
                        "Name": "testnet",
                        "Labels": {},
                        "DriverConfiguration": {
                            "Name": "overlay"
                        },
                        "Attachable": true,
                        "IPAMOptions": {
                            "Driver": {
                                "Name": "default"
                            },
                            "Configs": [
                                {
                                    "Subnet": "172.25.0.0/16",
                                    "Gateway": "172.25.0.1"
                                }
                            ]
                        },
                        "Scope": "swarm"
                    },
                    "DriverState": {
                        "Name": "overlay",
                        "Options": {
                            "com.docker.network.driver.overlay.vxlanid_list": "4097"
                        }
                    },
                    "IPAMOptions": {
                        "Driver": {
                            "Name": "default"
                        },
                        "Configs": [
                            {
                                "Subnet": "172.25.0.0/16",
                                "Gateway": "172.25.0.1"
                            }
                        ]
                    }
                },
                "Addresses": [
                    "172.25.96.221/16"
                ]
            }
        ],
        "GenericResources": [
            {
                "NamedResourceSpec": {
                    "Kind": "gpu",
                    "Value": "GPU-50fd60c4"
                }
            }
        ]
    }
]

My overlay network:

docker inspect d9gqsljvmpy7
[
    {
        "Name": "testnet",
        "Id": "d9gqsljvmpy7wjrxa5q09bgtb",
        "Created": "2020-10-12T13:39:55.061260869Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.25.0.0/16",
                    "Gateway": "172.25.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": null,
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4097"
        },
        "Labels": null
    }
]

Service starts normally without ether --network or --generic-resource. Starting without --network and attaching after start also doesn't work.

I enabled debug logs on both nodes but didn't see anything suspicious other than same error message:

Oct 12 13:40:45 node-4 dockerd[1166]: time="2020-10-12T13:40:45.975574449Z" level=error msg="fatal task error" error="node is missing network attachments, ip addresses may be exhausted" module=node/agent/taskmanager node.id=xyriecy63n8995enp2mro0nvx service.id=mef68axo6ztmu7ojkiwcxxj0a task.id=twcbj9emeopm2qfq0i7lwftbe

Also I tested network exhaustion with docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock docker/ip-util-check and obviously it finds nothing:

Overlay IP Utilization Report
----
Network testnet/d9gqsljvmpy7 has an IP address capacity of 65533 and uses 0 addresses spanning over 0 nodes
        Network OK: network will have 49149 available IPs before passing the 75% subnet use

So, how can one start gpu-tied service and attach it to overlay network?

1

There are 1 answers

0
Kirill On

Apparently, there is no need to specify --generic-resource in my case. Without it service has access to all gpus, listed to docker via --node-generic-resource gpu=xxx. Downside is you can't control gpu count per service, but I can live with it.