I have a Kubernetes cluster on GKE. Among others, my current layout has a Pod (worker-pod) configured with an Horizontal pod autoscaler, which scales on an external metric provided on Stackdriver by BlueMedora's BindPlane.
The autoscaling works perfectly, but sometimes when it's time to scale down, the pods get drained while doing a task that never gets finished.
The pod is running a Celery worker, while the job queues are managed by another Pod with RabbitMQ, I'm not sure wheter to fix this on the K8s side or rabbitMQ side.
How can I avoid the HPA to downsize a pod while he's doing a task?
My pod specification (simplified):
apiVersion: apps/v1
kind: Deployment
metadata:
name: pod-worker
labels:
component: worker
spec:
selector:
matchLabels:
app: pod-worker
replicas: 1
template:
metadata:
labels:
app: pod-worker
component: worker
spec:
containers:
- name: worker
image: custom-image:latest
imagePullPolicy: Always
command: ['celery']
args: ['worker','-A','celery_tasks.task','-l','info', '-Q', 'default,priority','-c','1', '-Ofair']
resources:
limits:
cpu: 500m
requests:
cpu: 150m
memory: 200Mi
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
restartPolicy: Always
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: pod-worker
labels:
component: worker
spec:
maxReplicas: 30
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: pod-worker
metrics:
- external:
metricName: external.googleapis.com|bluemedora|generic_node|rabbitmq|cluster|messages
targetAverageValue: "40"
type: External
To fix this you have multiple approaches, first, to avoid losing messages to process you need to use RabbitMQ manual ACKs, which you need to ACK after the work is succesfull, if it fails then the task will be requeued and then reprocessed.
Second, essentially, when the autoscaling (downscaling) starts it will be sent a SIGTERM signal and wait until the variable (in podSpec):
terminationGracePeriodSeconds: 90
So you can tinker with that variable and se it a little high so it would be able to gracefully shutdown after the task is done.
After the terminationGracePeriodSeconds time has passed, the pod will receive a SIGKILL signal, which will kill the pod.
Also, you can handle these signals with python, here is a small example: