Delayed sequential restart of Compute Engine VMs in Managed Instance Groups

562 views Asked by At

I have a Managed Instance Group of Google Compute Engine VMs (based on a template with container deployment on Container-Optimized OS). The MIG is regional (multi-zoned).

I can release an updated container image (docker run, docker tag, docker push), and then I'd like to restart all VMs in the MIG one by one, so that they can have the updated container (not sure if there's a simpler/better alternative to refresh the VMs attached container). But I also want to introduce a slight delay (say 60 seconds) between each VM's restart event, so that only one or two VMs are unavailable during their restart.

What are some ways to do this programmatically (either via gcloud CLI or their API)?

I tried a rolling restart of the MIG, with maximum unavailable and minimum wait time flags set:

gcloud beta compute instance-groups managed rolling-action restart MIG_NAME \
    --project="..." --region="..." \
    --max-unavailable=1 --min-ready=60

... but it returns an error:

ERROR: (gcloud.beta.compute.instance-groups.managed.rolling-action.restart) Could not fetch resource:
 - Invalid value for field 'resource.updatePolicy.maxUnavailable.fixed': '1'. Fixed updatePolicy.maxUnavailable for regional managed instance group has to be either 0 or at least equal to the number of zones.

Is there a way to perform one-by-one instance restarts with a slight delay in between each action?

2

There are 2 answers

5
Grzenio On BEST ANSWER

Unfortunately the MIGs don't handle this use-case for regional deployments as at Jan 2023. You can, however, orchestrate the rolling update yourself along (sudo code):

for (INSTANCE in instances)
  // Force restart the instance
  gcloud compute instance-groups managed update-instances MIG_NAME \
      --project="..." --region="..." \
      --instances=INSTANCE --minimal-action=RESTART \
      --most-disruptive-allowed-action=RESTART

  WAIT

  if (container on INSTANCE not working correctly)
      // Break and alert the operator
0
user21311512 On

Trying looking into opportunistic updates instead of rolling updates. We have a similar scenario. Rolling updates for MIG, particularly a stateful one won't work as it will bring down at least a minimum number (ideally the number of zones that you have in your MIG) With opportunistic updates, you can try to achieve what you are looking for. Currently we implement it the following way:

  • Set the instance template of the MIG to the new instance template created from new image
gcloud compute instance-groups managed set-instance-template ${instanceName} template=${instanceName}-${tag}
  • Run a for loop and update each VM with new template. Google provides a command which will pause the execution of the script till the MIG is stable, this ensures that you are not applying updates to another VM until your current instance is stable.
for (( i = 1; i <= $number_of_nodes; i++ ))
    do
        echo "Trying to update Kafka Node${i} with new instance template ${instanceName}-${tag}"
        (set -x
            gcloud compute instance-groups managed update-instances ${instanceName}-group \
           --instances=${instanceName}-kafka-node${i} \
        )
        echo "Checking for MIG stabiltiy"
        (set -x
            gcloud compute instance-groups managed wait-until ${instanceName}-group \
            --stable \
            --region=${region}
        )
    done

You can have a look at this documentation.