Trouble with Prometheus metrics (Adapter and metricsQuery)

730 views Asked by At

Original problem. I would like to have a Kubernetes cluster with at least 2 nodes with zero GPU consumption. If a job is coming and takes one node, then autoscaler should create another spare node.

I found out that I can rely on DCGM_FI_DEV_GPU_UTIL metrics. If DCGM_FI_DEV_GPU_UTIL == 0 then the node is in "idle" mode. In PromQL I can just write count(DCGM_FI_DEV_GPU_UTIL == 0) and get the number of "idle" nodes.

However, I do not understand how to write metricsQuery in Prometheus Adapter config. All examples that I found are about

(sum(rate(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)

However, I need something like count(<<.Series>> == 0), but this does not work. Any idea how I can get this metrics for HPA which indicates the number of nodes with no GPU consumption?

2

There are 2 answers

0
Trarbish On BEST ANSWER

I ended up with KEDA with the prometheus trigger. It is easy to use and supports PromQL query. The only disadvantage that it is "average value" scaler, but it is not critical in my case.

8
Vitezslav Skacel On

Probably your jobs are running in Kubernetes Pod. You may have a configuration where only one custom Pod with job can run on a single Node. The first step is to configure your metrics for the Prometheus adapter and it's described quite nicely here. This step will ensure that the Pod is added.

In the second step you need to configure a cluster autoscaler that will add another Node when needed. Cluster autoscaler is dependent on your Kubernetes solution provider (AWS, Azure, GCP...) and should be in their documentation. I personally use Cluster autoscaler, Karpenter.