Keda has been solid as a rock for us, however we've had some very strange issues related to scaling jobs after an initial scale.
We deploy our ScaledJobs
using multiple versions listening to a unique redis queue per unique version. Each job is configured in the same namespace with a unique version name.
Configuration looks like this:
Max Replica Count: 8
Min Replica Count: 0
Polling Interval: 15
Rollout:
Scaling Strategy:
Successful Jobs History Limit: 0
Triggers:
Metadata:
Enable TLS: true
Host: [IP Address]
List Length: 1
List Name: [List Name]
Password From Env: CELERY_PASS
Port: 6378
Type: redis
If we submit jobs in the queue, they will scale just fine for some, however subsequent submissions will sometimes not trigger scaling. The part which seems most suspect is that the operator logs show the metrics for the running pods, but shows 0
for pending, even though the redis list clearly has the items in it.
2023-10-27T18:24:10Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "[Scaled Job Name]", "scaledJob.Namespace": "[Namespace]", "Number of running Jobs": 2}
2023-10-27T18:24:10Z INFO scaleexecutor Scaling Jobs {"scaledJob.Name": "[Scaled Job Name]", "scaledJob.Namespace": "[Namespace]", "Number of pending Jobs ": 0}
Is there some undocumented throttling / caching / timeout that might be causing this?