Karpenter consolidation downtime

190 views Asked by At

Topic: Karpenter/EKS

Observed behavior:

  1. Karpenter scales out nodes with various instance types as work load increases.
  2. Karpenter creates a few new nodes (ex. t3.medium instances x3)
  3. While pods are initializing (before running) on those new nodes, Karpenter decides that it is better to use a single bigger instance (ex. t3.2xlarge x1) and decides to terminate all the pods both pending and running and starts to schedule the pods on the new node and this causes down time.

Question: Is there a way to take advantage of Karpenter's flexibility of changing instances types and size freely but also at the same time make sure at least 1 pod is always available?

Attempts: Tried to set the PodDisruptionBudget to have at least 1 min available and while this does ensure that at least 1 pod is always avaialble, the issue arises when Karpenter finds better options for EC2 instance or when it needs to scale down. For example, if a new t3.2xlarge was created due to load increase and after some time the load decreases, then I end up with one t3.2xlarge instance running a single/a couple pods as it cannot be deleted due to the PDB.

Expected Behavior: Expected Karpenter to make sure to keep one node and make sure that at least one Pod is available until a new node with new Pods are running before changing instances and terminating all pods.

0

There are 0 answers