I'm trying to restore a 5 node ETCD cluster(using DNS discovery) on Amazon ECS from a snapshot, but whats happening is that each node is starting up as a single-node cluster, and the nodes aren't adding each other as members.
The start script inside the docker container for etcd is as follows
THIS_IP=$(curl http://169.254.169.254/latest/meta-data/local-ipv4)
THIS_NAME=$(curl http://169.254.169.254/latest/meta-data/hostname | cut -d . -f 1)
aws s3 cp s3://test_bucket/snapshot.db .
if [ -f "./snapshot.db" ]; then
echo "restoring from db...."
ETCDCTL_API=3 etcdctl --data-dir ${THIS_NAME}.etcd snapshot restore snapshot.db
fi
etcd --data-dir=${THIS_NAME}.etcd --name ${THIS_NAME} --discovery-srv ${DISCOVERY_SRV} --initial-advertise-peer-urls http://${THIS_IP}:2380 --listen-peer-urls http://0.0.0.0:2380 --advertise-client-urls http://${THIS_IP}:2380 --listen-client-urls http://0.0.0.0:2379 --initial-cluster-state ${CLUSTER_STATE} --initial-cluster-token ${TOKEN}
The way it works is that the node name(THIS_NAME) becomes something related to the ip-address of the container, something like ip-10-0-6-22
, and the private ip address(THIS_IP) is retrieved via amazon ip metadata.
The logs look like this
2021-04-16 10:22:07.689108 W | etcdserver: read-only range request "key:\"/runtime/corev3sit/ocbc/@shared/counterparties/saxo/last_activities_synced_time\" " with result "error:auth: invalid auth token" took too long (1m59.99874666s) to execute
2021-04-16 10:22:07.689108 W | etcdserver: read-only range request "key:\"/runtime/corev3sit/ocbc/@shared/counterparties/saxo/last_activities_synced_time\" " with result "error:auth: invalid auth token" took too long (1m59.99874666s) to execute
2021-04-16 10:15:10.691908 N | etcdserver/membership: set the initial cluster version to 3.2
2021-04-16 10:15:10.691954 I | etcdserver/api: enabled capabilities for version 3.2
2021-04-16 10:15:10.690268 I | etcdserver: setting up the initial cluster version to 3.2
2021-04-16 10:15:10.690348 I | etcdserver: published {Name:ip-10-6-0-44 ClientURLs:[http://10.6.0.44:2380]} to cluster cdf818194e3a8c32
2021-04-16 10:15:10.690581 I | embed: ready to serve client requests
2021-04-16 10:15:10.691082 N | embed: serving insecure client requests on [::]:2379, this is strongly discouraged!
2021-04-16 10:15:10.689378 I | raft: 8e9e05c52164694d is starting a new election at term 1
2021-04-16 10:15:10.689459 I | raft: 8e9e05c52164694d became candidate at term 2
2021-04-16 10:15:10.689480 I | raft: 8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 2
2021-04-16 10:15:10.689512 I | raft: 8e9e05c52164694d became leader at term 2
2021-04-16 10:15:10.689523 I | raft: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 2
2021-04-16 10:15:09.812773 I | etcdserver: 8e9e05c52164694d as single-node; fast-forwarding 9 ticks (election ticks 10)
2021-04-16 10:15:09.805550 I | etcdserver: starting server... [version: 3.2.26, cluster version: to_be_decided]
2021-04-16 10:15:09.801880 W | auth: simple token is not cryptographically signed
2021-04-16 10:15:09.788452 I | etcdserver: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 1
2021-04-16 10:15:09.788569 I | raft: 8e9e05c52164694d became follower at term 1
2021-04-16 10:15:09.788668 I | raft: newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 1, commit: 1, applied: 1, lastindex: 1, lastterm: 1]
2021-04-16 10:15:09.788899 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store
2021-04-16 10:15:09.787344 I | etcdserver: name = ip-10-6-0-44
2021-04-16 10:15:09.787530 I | etcdserver: data dir = ip-10-6-0-44.etcd
2021-04-16 10:15:09.787634 I | etcdserver: member dir = ip-10-6-0-44.etcd/member
2021-04-16 10:15:09.787722 I | etcdserver: heartbeat = 100ms
2021-04-16 10:15:09.787780 I | etcdserver: election = 1000ms
2021-04-16 10:15:09.787858 I | etcdserver: snapshot count = 100000
2021-04-16 10:15:09.787951 I | etcdserver: advertise client URLs = http://10.6.0.44:2380
2021-04-16 10:15:09.769610 I | etcdserver: recovered store from snapshot at index 1
2021-04-16 10:15:09.768017 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2021-04-16 10:15:09.768251 I | embed: listening for peers on http://0.0.0.0:2380
2021-04-16 10:15:09.768367 I | embed: listening for client requests on 0.0.0.0:2379
2021-04-16 10:15:09.767325 I | etcdmain: etcd Version: 3.2.26
2021-04-16 10:15:09.767583 I | etcdmain: Git SHA: Not provided (use ./build instead of go build)
2021-04-16 10:15:09.767670 I | etcdmain: Go Version: go1.11.6
2021-04-16 10:15:09.767783 I | etcdmain: Go OS/Arch: linux/amd64
2021-04-16 10:15:09.767844 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2021-04-16 10:15:09.745143 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
2021-04-16 15:45:09 restoring from db....
2021-04-16 15:45:09 Completed 256.0 KiB/668.0 KiB (3.5 MiB/s) with 1 file(s) remaining Completed 512.0 KiB/668.0 KiB (6.5 MiB/s) with 1 file(s) remaining Completed 668.0 KiB/668.0 KiB (8.4 MiB/s) with 1 file(s) remaining download: s3://etcd/snapshot.db to ./snapshot.db
2021-04-16 15:45:09 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 44 100 44 0 0 11000 0 --:--:-- --:--:-- --:--:-- 11000
2021-04-16 15:45:09 % Total % Received % Xferd Average Speed Time Time Time Current
2021-04-16 15:45:09 Dload Upload Total Spent Left Speed
2021-04-16 15:45:08 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 9 100 9 0 0 3000 0 --:--:-- --:--:-- --:--:-- 3000
2021-04-16 15:45:08 % Total % Received % Xferd Average Speed Time Time Time Current
2021-04-16 15:45:08 Dload Upload Total Spent Left Speed
Can anyone help me with this issue?
Sharing this for other people who run into this issue. It looks like discovery is only supported during initial cluster setup, and a minimum quorum of servers need to be maintained for any kind of node addition and removal.
That left me with only one option...I created a single node etcd cluster from the backup and used
etcdctl make-mirror
to get the job done.Not the best of ways to restore a backup, but atleast I didn't lose any data.