Docker Swarm DNS fails for service name, but service virtual IP address and task IPs are resolved correctly. How to debug this?

43 views Asked by At

TLDR;

In summary, from within a service container defined on the same stack as the service nginx:

nslookup nginx          # gives the virtual IP of the nginx service
nslookup tasks.nginx    # gives the correct IP of an nginx container (10.0.17.20)
ping 10.0.17.20         # this works
ping nginx              # doesn't work
curl http://10.0.17.20  # this works
curl http://nginx       # doesn't work

However, curl http://tasks.nginx does resolve.

The question

Within a Docker Swarm, I am findit that DNS resolution by servie name is failing from within containers. For example, with the following stack configuration:

version: "3.9"

networks:
  elk7:
    name: elk7
    driver: overlay
    attachable: true
    ipam:
      driver: default
      config:
        - subnet: "10.0.17.0/24"

services:
  setup:
    ...
    networks:
      - elk7

  es01: # placed on manger node 1
    ...
    networks:
      - elk7

  # ... es02/es03/etc

  nginx: # placed on manger node 2
    ...
    networks:
      - elk7

Manager Node 1

docker network inspect elk7 shows that on this node a container for the es01 service exists (I assume that I'm SUPPOSED to see only this node's containers?)

"9c1a019a5c83c466615819b5401bbb0e58c31f078a96f13ed4af3905c837d565": {
    "Name": "elk7_es01.1.meux9ctcmnfwdejiqmxftyeq8",
    "EndpointID": "e0b102827e93eb3d0439513778c308eeb1201cd1e8e252f1361692c2f9981cc5",
    "MacAddress": "02:42:0a:00:11:11",
    "IPv4Address": "10.0.17.17/24",
    "IPv6Address": ""
},

And the IPAM section also seems useful to put here:

"IPAM": {
    "Driver": "default",
    "Options": null,
    "Config": [
        {
            "Subnet": "10.0.17.0/24",
            "Gateway": "10.0.17.1"
        }
    ]
},

Manager Node 2

Logging into the Nginx container (docker container exec -it <container id> bash), I am not able to contact the es01 service via the service name, but I CAN contact it via the IP address:

root@2d6f42945a18:/# ping es01
PING es01 (10.0.17.16) 56(84) bytes of data.
From 2d6f42945a18 (10.0.17.20) icmp_seq=1 Destination Host Unreachable
From 2d6f42945a18 (10.0.17.20) icmp_seq=2 Destination Host Unreachable
From 2d6f42945a18 (10.0.17.20) icmp_seq=3 Destination Host Unreachable

--- es01 ping statistics ---
5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4090ms

vs

root@2d6f42945a18:/# ping 10.0.17.17
PING 10.0.17.17 (10.0.17.17) 56(84) bytes of data.
64 bytes from 10.0.17.17: icmp_seq=1 ttl=64 time=0.220 ms
64 bytes from 10.0.17.17: icmp_seq=2 ttl=64 time=0.128 ms

--- 10.0.17.17 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1016ms
rtt min/avg/max/mdev = 0.128/0.174/0.220/0.046 ms

When I do an nslookup or dig I get an answer - so the service name es01 does seem to resolve:

root@2d6f42945a18:/# nslookup es01
Server:     127.0.0.11
Address:    127.0.0.11#53

Non-authoritative answer:
Name:   es01
Address: 10.0.17.16

============================================================

root@2d6f42945a18:/# dig es01

; <<>> DiG 9.18.24-1-Debian <<>> es01
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17322
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;es01.              IN  A

;; ANSWER SECTION:
es01.           600 IN  A   10.0.17.16

;; Query time: 0 msec
;; SERVER: 127.0.0.11#53(127.0.0.11) (UDP)
;; WHEN: Tue Mar 26 07:10:16 UTC 2024
;; MSG SIZE  rcvd: 42

The 10.0.17.16 IP address is (I've just learned) the virual IP address associated with the es01 service: docker service inspect elk7_es01 shows:

...
"Endpoint": {
    "Spec": {
        "Mode": "vip"
    },
    "VirtualIPs": [
        {
            "NetworkID": "rdtgyz97aahhsrlwz8u2mm8fy",
            "Addr": "10.0.17.16/24"
        }
    ]
}

I'm unsure why I cannot contact a service task (i.e. container) via service name, when I CAN resolve the virtual IP of a service from within another service task (container).

What could be the issue? My Swarm nodes are all Ubuntu 20.04 LXC containers configured via Proxmox 7.0.11. The host IP addresses are all in the 10.8.66.0/24 range (not sure if this is important or not).

I do see that there are similar questions on Stack Overflow (Docker Swarm Failing to Resolve DNS by Service Name With Python Celery Workers Connecting to RabbitMQ Broker Resulting in Connection Timeout), however the answer didn't help in my case.

Another question (Wrong IP address in docker swarm service) mentions that you can look at the DNS resolution for service tasks explicitly via nslookup tasks.es01, and that gives the correct container IP addresses:

root@2d6f42945a18:/# nslookup tasks.es01
Server:     127.0.0.11
Address:    127.0.0.11#53

Non-authoritative answer:
Name:   tasks.es01
Address: 10.0.17.17
0

There are 0 answers