ETCD Snapshot restore + DNS discovery issue

374 views Asked by At

I'm trying to restore a 5 node ETCD cluster(using DNS discovery) on Amazon ECS from a snapshot, but whats happening is that each node is starting up as a single-node cluster, and the nodes aren't adding each other as members.

The start script inside the docker container for etcd is as follows

THIS_IP=$(curl http://169.254.169.254/latest/meta-data/local-ipv4)
THIS_NAME=$(curl http://169.254.169.254/latest/meta-data/hostname | cut -d . -f 1)
aws s3 cp s3://test_bucket/snapshot.db .
if [ -f "./snapshot.db" ]; then
    echo "restoring from db...."
    ETCDCTL_API=3 etcdctl --data-dir ${THIS_NAME}.etcd snapshot restore snapshot.db

fi
etcd --data-dir=${THIS_NAME}.etcd --name ${THIS_NAME} --discovery-srv ${DISCOVERY_SRV}  --initial-advertise-peer-urls http://${THIS_IP}:2380 --listen-peer-urls http://0.0.0.0:2380 --advertise-client-urls http://${THIS_IP}:2380 --listen-client-urls http://0.0.0.0:2379 --initial-cluster-state ${CLUSTER_STATE} --initial-cluster-token ${TOKEN}

The way it works is that the node name(THIS_NAME) becomes something related to the ip-address of the container, something like ip-10-0-6-22, and the private ip address(THIS_IP) is retrieved via amazon ip metadata.

The logs look like this

2021-04-16 10:22:07.689108 W | etcdserver: read-only range request "key:\"/runtime/corev3sit/ocbc/@shared/counterparties/saxo/last_activities_synced_time\" " with result "error:auth: invalid auth token" took too long (1m59.99874666s) to execute
2021-04-16 10:22:07.689108 W | etcdserver: read-only range request "key:\"/runtime/corev3sit/ocbc/@shared/counterparties/saxo/last_activities_synced_time\" " with result "error:auth: invalid auth token" took too long (1m59.99874666s) to execute
2021-04-16 10:15:10.691908 N | etcdserver/membership: set the initial cluster version to 3.2
2021-04-16 10:15:10.691954 I | etcdserver/api: enabled capabilities for version 3.2
2021-04-16 10:15:10.690268 I | etcdserver: setting up the initial cluster version to 3.2
2021-04-16 10:15:10.690348 I | etcdserver: published {Name:ip-10-6-0-44 ClientURLs:[http://10.6.0.44:2380]} to cluster cdf818194e3a8c32
2021-04-16 10:15:10.690581 I | embed: ready to serve client requests
2021-04-16 10:15:10.691082 N | embed: serving insecure client requests on [::]:2379, this is strongly discouraged!
2021-04-16 10:15:10.689378 I | raft: 8e9e05c52164694d is starting a new election at term 1
2021-04-16 10:15:10.689459 I | raft: 8e9e05c52164694d became candidate at term 2
2021-04-16 10:15:10.689480 I | raft: 8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 2
2021-04-16 10:15:10.689512 I | raft: 8e9e05c52164694d became leader at term 2
2021-04-16 10:15:10.689523 I | raft: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 2
2021-04-16 10:15:09.812773 I | etcdserver: 8e9e05c52164694d as single-node; fast-forwarding 9 ticks (election ticks 10)
2021-04-16 10:15:09.805550 I | etcdserver: starting server... [version: 3.2.26, cluster version: to_be_decided]
2021-04-16 10:15:09.801880 W | auth: simple token is not cryptographically signed
2021-04-16 10:15:09.788452 I | etcdserver: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 1
2021-04-16 10:15:09.788569 I | raft: 8e9e05c52164694d became follower at term 1
2021-04-16 10:15:09.788668 I | raft: newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 1, commit: 1, applied: 1, lastindex: 1, lastterm: 1]
2021-04-16 10:15:09.788899 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store
2021-04-16 10:15:09.787344 I | etcdserver: name = ip-10-6-0-44
2021-04-16 10:15:09.787530 I | etcdserver: data dir = ip-10-6-0-44.etcd
2021-04-16 10:15:09.787634 I | etcdserver: member dir = ip-10-6-0-44.etcd/member
2021-04-16 10:15:09.787722 I | etcdserver: heartbeat = 100ms
2021-04-16 10:15:09.787780 I | etcdserver: election = 1000ms
2021-04-16 10:15:09.787858 I | etcdserver: snapshot count = 100000
2021-04-16 10:15:09.787951 I | etcdserver: advertise client URLs = http://10.6.0.44:2380
2021-04-16 10:15:09.769610 I | etcdserver: recovered store from snapshot at index 1
2021-04-16 10:15:09.768017 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2021-04-16 10:15:09.768251 I | embed: listening for peers on http://0.0.0.0:2380
2021-04-16 10:15:09.768367 I | embed: listening for client requests on 0.0.0.0:2379
2021-04-16 10:15:09.767325 I | etcdmain: etcd Version: 3.2.26
2021-04-16 10:15:09.767583 I | etcdmain: Git SHA: Not provided (use ./build instead of go build)
2021-04-16 10:15:09.767670 I | etcdmain: Go Version: go1.11.6
2021-04-16 10:15:09.767783 I | etcdmain: Go OS/Arch: linux/amd64
2021-04-16 10:15:09.767844 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2021-04-16 10:15:09.745143 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
2021-04-16 15:45:09 restoring from db....
2021-04-16 15:45:09 Completed 256.0 KiB/668.0 KiB (3.5 MiB/s) with 1 file(s) remaining Completed 512.0 KiB/668.0 KiB (6.5 MiB/s) with 1 file(s) remaining Completed 668.0 KiB/668.0 KiB (8.4 MiB/s) with 1 file(s) remaining download: s3://etcd/snapshot.db to ./snapshot.db
2021-04-16 15:45:09 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 44 100 44 0 0 11000 0 --:--:-- --:--:-- --:--:-- 11000
2021-04-16 15:45:09 % Total % Received % Xferd Average Speed Time Time Time Current
2021-04-16 15:45:09 Dload Upload Total Spent Left Speed
2021-04-16 15:45:08 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 9 100 9 0 0 3000 0 --:--:-- --:--:-- --:--:-- 3000
2021-04-16 15:45:08 % Total % Received % Xferd Average Speed Time Time Time Current
2021-04-16 15:45:08 Dload Upload Total Spent Left Speed

Can anyone help me with this issue?

1

There are 1 answers

0
pragman On

Sharing this for other people who run into this issue. It looks like discovery is only supported during initial cluster setup, and a minimum quorum of servers need to be maintained for any kind of node addition and removal.

That left me with only one option...I created a single node etcd cluster from the backup and used etcdctl make-mirror to get the job done.

Not the best of ways to restore a backup, but atleast I didn't lose any data.