I am currently planning to upgrade our Cloud Composer environment from Composer 1 to 2. However I am quite concerned about disruptions that could occur in Cloud Composer 2 due to the new autoscaling behavior inherited from GKE Autopilot. In particular since nodes will now auto-scale based on demand, it seems like nodes with running workers could be killed off if GKE thinks the workers could be rescheduled elsewhere. This would be bad because my code isn't currently very tolerant to retries.
I think that this can be prevented by adding the following annotation to the worker pods: "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
However, I don't know how to add annotations to worker pods created by Composer (I'm not creating them myself, after all). How can I do that?
EDIT: I think this issue is made more complex by the fact that it should still be possible for the cluster to evict a pod once it's finished processing all its Airflow tasks. If the annotation is added but doesn't go away once the pod is finished processing, I'm worried that could prevent the cluster from ever scaling down.
So a more dynamic solution may be needed, perhaps one that takes into account the actual tasks that Airflow is processing.
If I have understood your problem well. Could you please try this solution:
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
Let me know if it works fine. Good luck.