I've seen instances in our autopilot clusters where pods stay stuck in a pending state for a long time. In autoscale logs, the messages only indicate that no current nodes can satisfy the constraints on the pod - there is no explanation for why a new node wouldn't simply be scheduled.

For example the logs may indicate at existing nodes have incompatible taint, or zone, or w/e - but none of that explains why the autoscaler doesnt just add a new node (and a new instance group if necessary).

My guess is the pods are over-constrained and adding a node would impact utilization too much. Is there any way to recover this or some other reason in the logs?

1

There are 1 answers

0
Dion V On

An example of an unsatisfiable condition would be a node-selector where the Pod is requesting to be placed in a zone that doesn't exist. Since the autoscaler can't create a node in a non-existent node, that Pod will forever sit in Pending as per William Denniss.

You can check this documentation as reference.