Problem
I am trying to implement a Horizontal Pod Autoscaler (HPA) on my AKS cluster. However, I'm unable to retrieve the GPU metrics (auto-generated by Azure) that my HPA requires to scale.
Example
As a reference, see this example where the HPA scales based on targetCPUUtilizationPercentage: 50
. That is, the HPA will deploy more/less pods to achieve a target of an average CPU utilization across all pods. Ideally, I want to achieve the same with the GPU.
Setup
I have deployed an AKS cluster with Azure Monitor enabled and my node size set to Standard_NC6_Promo
- Azure's VM option that comes equipped with Nvidia's Tesla K80 GPU. However, in order to utilize the GPU, you must first install the appropriate plugin into your cluster, as explained here. Once you install this plugin a number of GPU metrics are automatically collected by Azure and logged to a table named "InsightsMetrics" (see). From what I can read, the metric containerGpuDutyCycle
will be the most beneficial for monitoring GPU utilization.
Current Situation
I can successfully see the insight metrics gathered by installed plugin, where one of the metrics is containerGpuDutyCycle
.
InsightsMetrics table inside of Logs tab of Kubernetes Service on Azure Portal
Now how to expose/provide this metric to my HPA?
Possible Solutions
What I've noticed is that if you navigate to the Metrics tab of your AKS cluster, you cannot retrieve these GPU metrics. I assume this is because these GPU "metrics" are technically logs and not "official" metrics. However, azure does support something called log-based metrics, where the results of log queries can be treated as an "official" metric, but nowhere do I see how to create my own custom log-based metric.
Furthermore, Kubernetes supports custom and external metrics through their Metrics API, where metrics can be retrieved from external sources (such as Azure's Application Insights). Azure has an implementation of the Metrics API called Azure Kubernetes Metrics Adapter. Perhaps I need to expose the containerGpuDutyCycle
metric as an external metric using this? If so, how do I reference/expose the metric as external/custom?
Alternative Solutions
My main concern is exposing the GPU metrics for my HPA. I'm using Azure's Kubernetes Metrics Adapter for now as I assumed it would better integrate into my AKS cluster (same eco-system). However, it's in alpha stage (not production ready). If anyone can solve my problem using an alternative metric adapter (e.g. Prometheus), that would still be very helpful.
Many thanks for any light you can shed on this issue.
I managed to do this recently (just this week). I'll outline my solution and all the gotchas, in case that helps.
Starting with an AKS cluster, I installed the following components in order to harvest the GPU metrics:
The AKS cluster comes with a metrics server built in, so you don't need to worry about that. It is also possible to provision the cluster with the nvidia-device-plugin already applied, but currently not possible via terraform (Is it possible to use aks custom headers with the azurerm_kubernetes_cluster resource?), which is how I was deploying my cluster.
To install all this stuff I used a script much like the following:
Actually, I'm lying about the
dcgm-exporter
. I was experiencing a problem (my first "gotcha") where thedcgm-exporter
was not responding to liveness requests in time, and was consistently entering aCrashLoopBackoff
status (https://github.com/NVIDIA/gpu-monitoring-tools/issues/120). To get around this, I created my owndcgm-exporter
k8s config (by taking details from here and modifying them slightly: https://github.com/NVIDIA/gpu-monitoring-tools) and applied it. In doing this I experienced my second "gotcha", which was that in the latestdcgm-exporter
images they have removed some GPU metrics, such asDCGM_FI_DEV_GPU_UTIL
, largely because these metrics are resource intensive to collect (see https://github.com/NVIDIA/gpu-monitoring-tools/issues/143). If you want to re-enable them make sure you run thedcgm-exporter
with the arguments set as:["-f", "/etc/dcgm-exporter/dcp-metrics-included.csv"]
OR you can create your own image and supply your own metrics list, which is what I did by using this Dockerfile:Another thing you can see from the above script is that I also used my own Prometheus helm chart values file. I followed the instructions from nvidia's site (https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html), but found my third "gotcha" in the
additionalScrapeConfig
.What I learned was that, in the final deployment, the HPA has to be in the same namespace as the service it's scaling (identified by
targetRef
), otherwise it can't find it to scale it, as you probably already know.But just as importantly the
dcgm-metrics
Service
also has to be in the same namespace, otherwise the HPA can't find the metrics it needs to scale by. So, I changed theadditionalScrapeConfig
to target the relevant namespace. I'm sure there's a way to use theadditionalScrapeConfig.relabel_configs
section to enable you to keepdcgm-exporter
in a different namespace and still have the HPA find the metrics, but I haven't had time to learn that voodoo yet.Once I had all of that, I could check that the DCGM metrics were being made available to the kube metrics server:
In the resulting list you really want to see a
services
entry, like so:If you don't it probably means that the dcgm-exporter deployment you used is missing the
ServiceAccount
component, and also the HPA still won't work.Finally, I wrote my HPA something like this:
and it all worked.
I hope this helps! I spent so long trying different methods shown by people on consultancy company blogs, medium posts etc before discovering that people who write these pieces have already made assumptions about your deployment which affect details that you really need to know about (eg: the namespacing issue).