Kubernetes Pod Horizontal Autoscaling safe drain, celery worker scales down mid-work

1.6k views Asked by At

I have a Kubernetes cluster on GKE. Among others, my current layout has a Pod (worker-pod) configured with an Horizontal pod autoscaler, which scales on an external metric provided on Stackdriver by BlueMedora's BindPlane.

The autoscaling works perfectly, but sometimes when it's time to scale down, the pods get drained while doing a task that never gets finished.

The pod is running a Celery worker, while the job queues are managed by another Pod with RabbitMQ, I'm not sure wheter to fix this on the K8s side or rabbitMQ side.

How can I avoid the HPA to downsize a pod while he's doing a task?

My pod specification (simplified):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pod-worker
  labels:
    component: worker
spec:
  selector:
    matchLabels:
      app: pod-worker
  replicas: 1
  template:
    metadata:
      labels:
        app: pod-worker
        component: worker
    spec:
      containers:
      - name: worker
        image: custom-image:latest
        imagePullPolicy: Always
        command: ['celery']
        args: ['worker','-A','celery_tasks.task','-l','info', '-Q', 'default,priority','-c','1', '-Ofair']
        resources:
          limits:
            cpu: 500m
          requests:
            cpu: 150m
            memory: 200Mi
        env:
         - name: POD_NAME
           valueFrom:
             fieldRef:
               fieldPath: metadata.name
      restartPolicy: Always
    
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: pod-worker
  labels:
    component: worker
spec:
  maxReplicas: 30
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: pod-worker
  metrics:
    - external:
        metricName: external.googleapis.com|bluemedora|generic_node|rabbitmq|cluster|messages
        targetAverageValue: "40"
      type: External
1

There are 1 answers

0
paltaa On BEST ANSWER

To fix this you have multiple approaches, first, to avoid losing messages to process you need to use RabbitMQ manual ACKs, which you need to ACK after the work is succesfull, if it fails then the task will be requeued and then reprocessed.

Second, essentially, when the autoscaling (downscaling) starts it will be sent a SIGTERM signal and wait until the variable  (in podSpec):

terminationGracePeriodSeconds: 90

So you can tinker with that variable and se it a little high so it would be able to gracefully shutdown after the task is done.

After the terminationGracePeriodSeconds time has passed, the pod will receive a SIGKILL signal, which will kill the pod.

Also, you can handle these signals with python, here is a small example:

import signal
import time
class GracefulKiller:
  kill_now = False
  def __init__(self):
    signal.signal(signal.SIGINT, self.exit_gracefully)
    signal.signal(signal.SIGTERM, self.exit_gracefully)
  def exit_gracefully(self,signum, frame):
    self.kill_now = True
if __name__ == '__main__':
  killer = GracefulKiller()
  while not killer.kill_now:
    time.sleep(1)
    print("doing something in a loop ...")
  print "End of the program. I was killed gracefully :)"