how to configure an alert when a specific pod in k8s cluster goes into Failed state?

1k views Asked by At

we are running spark on k8s cluster with help of spark-operator. for monitoring we are using prometheus.

we want to configure an alert so that whenever any pod related to spark jobs transition to Failed state we should get an alert. and this alert rule should check for such failed pods over last 5 minutes duration.

we tried to leverage the kube-state-metrics for this but we are not able to get metrics on time based. at any given point of time metric kube_pod_status_phase{namespace="spark-operator",phase="Failed"} gives us the list of all the pods which are in failed state.

any suggestion or guidance on this are most welcome.

1

There are 1 answers

0
Kim On
sum_over_time (kube_pod_status_phase{namespace="spark-operator",phase="Failed"}[5m:1m]) > 0