I'm setting up Grafana alerts and need guidance on the conditions. I want two separate alerts to trigger if, over the last 10 minutes, the average CPU usage in any pod (across all namespaces) exceeds 90% of the respective pod's CPU limit, and similarly for Memory usage.
Can someone help with the expressions for these scenarios?
I tried this for Memory usage:
avg_over_time(container_memory_usage_bytes[10m]) >= kube_pod_container_resource_limits{resource="memory"} * 0.9
This of course didn't work. I'm expecting it to return the pods that over the last 10 minutes, the average CPU/Memory usage exceeds 90% of the actual pod's CPU/Memory limit.
Update:
I think I managed to build one of the queries I wanted but for a specific pod.
Here is the query for Memory:
avg_over_time(container_memory_usage_bytes{pod="nginx-f7d787f6c-t8x9s", container="nginx"}[10m]) > on(pod_uid) kube_pod_container_resource_limits{resource="memory", pod="nginx-f7d787f6c-t8x9s", container="nginx"} * 0.9
I need this query to run for the entire pods in the cluster and not just for a specific pod.
Example:
| pod name | Containers | Memory usage / 10 minutes | Memory limit |
|---|---|---|---|
| Pod1 | 5 | 9Mi | 10Mi |
| Pod2 | 3 | 9Mi | 1000Mi |
Since the containers of Pod1 are using 90% of their Memory limit, I expect them to show in the query result.