CoreOS Partial Cluster Update

125 views Asked by At

I'm trying to setup a smallish CoreOS cluster on AWS EC2 instances within VPC. For this exercise I'm using two auto scaling groups, one of 3 machines which will form core etcd and consul cluster and then second auto scaling group currently with a single node that will actually scale as application grows. All of them are in a common etcd cluster.

This week coreos.com released build 681 to stable branch, immediately one of the nodes got updated to 681.0, however 48 hours later 3 nodes in the main cluster remain on version 647.2. When I'm checking journals I'm seeing the following:

Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: [0611/142517:INFO:libcurl_http_fetcher.cc(48)] Starting/Resuming transfer
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: [0611/142517:INFO:libcurl_http_fetcher.cc(164)] Setting up curl options for HTTPS
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: [0611/142517:INFO:libcurl_http_fetcher.cc(427)] Setting up timeout source: 1 seconds.
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: [0611/142517:INFO:libcurl_http_fetcher.cc(240)] HTTP response code: 200
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: [0611/142517:INFO:libcurl_http_fetcher.cc(297)] Transfer completed (200), 267 bytes downloaded
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: [0611/142517:INFO:omaha_request_action.cc(574)] Omaha request response: <?xml version="1.0" encoding="UTF-8"?>
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: <response protocol="3.0" server="update.core-os.net">
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: <daystart elapsed_seconds="0"></daystart>
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: <app appid="e96281a6-xxxx-xxxx-xxxx-xxxxxxxxxxxx" status="ok">
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: <updatecheck status="noupdate"></updatecheck>
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: </app>
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: </response>

So nodes get response that there are no updates.

Is this the way coreos team trying to load balance their file servers or is there some additional configured? Is this the way coreos tries to nudge me towards paid services? My understanding of the update process was that nodes would get updated one after another like dominoes.

This is my current state of the cluster:

for m in $(fleetctl list-machines -fields="machine" -full -no-legend); do fleetctl ssh $m cat /etc/lsb-release; done
DISTRIB_ID=CoreOS
DISTRIB_RELEASE=647.2.0
DISTRIB_CODENAME="Red Dog"
DISTRIB_DESCRIPTION="CoreOS 647.2.0"
DISTRIB_ID=CoreOS
DISTRIB_RELEASE=681.0.0
DISTRIB_CODENAME="Red Dog"
DISTRIB_DESCRIPTION="CoreOS 681.0.0"
DISTRIB_ID=CoreOS
DISTRIB_RELEASE=647.2.0
DISTRIB_CODENAME="Red Dog"
DISTRIB_DESCRIPTION="CoreOS 647.2.0"
DISTRIB_ID=CoreOS
DISTRIB_RELEASE=647.2.0
DISTRIB_CODENAME="Red Dog"
DISTRIB_DESCRIPTION="CoreOS 647.2.0"

Update a week later: the cluster is still stuck in semi upgraded state. Would love to know how I can debug this kind of problem if anyone has any experience.

1

There are 1 answers

0
Chance Zibolski On

As mentioned in the comments, in situations like this, it's possible that a machine received an update, and after looking at the number of failures, the CoreOS OS team decided to stop rolling out the update to more hosts, to avoid causing more failures.

If you ever want to force an update check, you can run:

$ update_engine_client -check_for_update
[0123/220706:INFO:update_engine_client.cc(245)] Initiating update check and install.

For more details refer to https://coreos.com/os/docs/latest/update-strategies.html