Prometheus Adapter - Problem with Custom Metrics ( FailedDiscoveryCheck)

920 views Asked by At

In my project, I am trying to implement custom metrics to support HPA in K8S.

With this guide Prometheus Custom Metrics Adapter step by step I prepare code and implement into my cluster (only different is that in my version of k8s I'm using v1.custom.metrics.k8s.io not v1beta1.custom.metrics.k8s.io)

I'm using GKE in GCP.

YAML file with pods and services:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: fb-poster
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus
          args:
            - '--storage.tsdb.retention=6h'
            - '--storage.tsdb.path=/prometheus'
            - '--config.file=/etc/prometheus/prometheus.yml'
          ports:
            - containerPort: 9090
          volumeMounts:
            - name: config-volume
              mountPath: /etc/prometheus/
            - name: data-volume
              mountPath: /prometheus
      volumes:
        - name: config-volume
          configMap:
            name: prometheus-config
        - name: data-volume
          emptyDir: {}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-adapter-monitor
  labels:
    app: prometheus-adapter-monitor
  namespace: fb-poster
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-adapter-monitor
  template:
    metadata:
      labels:
        app: prometheus-adapter-monitor
      name: prometheus-adapter-monitor
    spec:

      serviceAccountName: cluster-monitoring
      containers:
        - name: prometheus-adapter-monitor
          image: directxman12/k8s-prometheus-adapter-amd64:v0.5.0
          args:
            - "--secure-port=6443"
            - "--tls-cert-file=/var/run/serving-cert/serving.crt"
            - "--tls-private-key-file=/var/run/serving-cert/serving.key"
            - "--logtostderr=true"
            - "--metrics-relist-interval=1m"
            - "--prometheus-url=http://prometheus-service:9090"
            - "--v=10"
            - "--config=etc/adapter/config.yml"
          ports:
            - containerPort: 6443
          volumeMounts:
            - mountPath: /var/run/serving-cert
              name: volume-serving-cert
              readOnly: true
            - mountPath: /etc/adapter/
              name: config
              readOnly: true
            - mountPath: /tmp
              name: tmp-vol
      volumes:
        - name: volume-serving-cert
          secret:
            secretName: cm-adapter-serving-certs
        - name: config
          configMap:
            name: prometheus-config
        - name: tmp-vol
          emptyDir: {}

---
apiVersion: v1
kind: Service
metadata:
  name: prometheus-service
spec:
  selector:
    app: prometheus
  ports:
    - protocol: TCP
      port: 9090
      targetPort: 9090
  type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus-adapter-monitor
  namespace: fb-poster
spec:
  ports:
    - port: 443
      targetPort: 6443
  selector:
    app: prometheus-adapter-monitor
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1.custom.metrics.k8s.io
spec:
  service:
    name: prometheus-adapter-monitor
    namespace: fb-poster
  group: custom.metrics.k8s.io
  version: v1
  insecureSkipTLSVerify: true
  groupPriorityMinimum: 100
  versionPriority: 100

configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: fb-poster
data:
  prometheus.yml: |
    global:
      scrape_interval: 10s
      evaluation_interval: 10s
    
    rule_files:
    - "custom_rules.yml"

    scrape_configs:
      - job_name: 'selenium-hub'
        static_configs:
          - targets: ['selenium-hub-export-service:9104']

      - job_name: 'prometheus'
        static_configs:
          - targets: ['prometheus-service:9090']

  custom_rules.yml: |
    groups:
      - name: custom.rules
        rules:
          - record: free_nodes_percentage
            expr: 100 * (selenium_grid_node_count - selenium_grid_session_count) / selenium_grid_node_count

  config.yml: |
    rules:
      - seriesQuery: 'free_nodes_percentage'
        resources:
          overrides:
            kubernetes_namespace: {resource: "namespace"}
            kubernetes_pod_name: {resource: "pod"}
        name:
          matches: "free_nodes_percentage"

After applying this YAML to the cluster, all my pods are working. from the prometheus-adapter-monitor pod, I get the logs:

I0620 21:06:01.122341       1 api.go:74] GET http://prometheus-service:9090/api/v1/series?match%5B%5D=selenium_grid_node_count%7Bkubernetes_namespace%21%3D%22%22%2Ckubernetes_pod_name%21%3D%22%22%7D&start=1687293961.117 200 OK      
I0620 21:06:01.122618       1 api.go:93] Response Body: {"status":"success","data":[]}
I0620 21:06:01.122800       1 provider.go:270] Set available metric list from Prometheus to: [[]]
I0620 21:06:01.286377       1 handler.go:143] prometheus-metrics-adapter: GET "/apis/custom.metrics.k8s.io/v1" satisfied by gorestful with webservice /apis/custom.metrics.k8s.io
I0620 21:06:01.286709       1 wrap.go:42] GET /apis/custom.metrics.k8s.io/v1: (736.267µs) 404 [[Go-http-client/2.0] 10.45.0.31:48374]

When i kubectl describe apiservice v1.custom.metrics.k8s.io i recived:

Name:         v1.custom.metrics.k8s.io
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Metadata:
  Creation Timestamp:  2023-06-20T21:04:55Z
  Managed Fields:
    API Version:  apiregistration.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        f:group:
        f:groupPriorityMinimum:
        f:insecureSkipTLSVerify:
        f:service:
          .:
          f:name:
          f:namespace:
          f:port:
        f:version:
        f:versionPriority:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2023-06-20T21:04:55Z
    API Version:  apiregistration.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          .:
          k:{"type":"Available"}:
            .:
            f:lastTransitionTime:
            f:message:
            f:reason:
            f:status:
            f:type:
    Manager:         kube-apiserver
    Operation:       Update
    Subresource:     status
    Time:            2023-06-20T21:05:02Z
  Resource Version:  11540103
  UID:               7c82cbb9-41df-4469-9394-a03da9d34bef
Spec:
  Group:                     custom.metrics.k8s.io
  Group Priority Minimum:    100
  Insecure Skip TLS Verify:  true
  Service:
    Name:            prometheus-adapter-monitor
    Namespace:       fb-poster
    Port:            443
  Version:           v1
  Version Priority:  100
Status:
  Conditions:
    Last Transition Time:  2023-06-20T21:04:55Z
    Message:               failing or missing response from https://10.45.1.70:6443/apis/custom.metrics.k8s.io/v1: bad status from https://10.45.1.70:6443/apis/custom.metrics.k8s.io/v1: 404
    Reason:                FailedDiscoveryCheck
    Status:                False
    Type:                  Available
Events:                    <none>

I have no clue how can i solve this problem. I found some related posts, but they don't cover my problem and are from a few years ago.


UPDATE 27.06.2023 @DanF according to your answer I have checked my cluster network details:

Private cluster Disabled        
Network default     
Subnet  default     
Stack type  IPv4        
Private control plane’s endpoint subnet default     
VPC-native traffic routing  Enabled     
Pod IPv4 address range (default)    10.82.128.0/17      
Cluster Pod IPv4 ranges (additional)    None        
IPv4 service range  10.83.0.0/22

so I have filled @DanF commend like: gcloud compute firewall-rules create allow-prometheus-adapter --action ALLOW --direction INGRESS --source-ranges 110.82.128.0/17 --rules tcp:6443 --network default

but still I'm reciving same error:

Spec:
  Group:                     custom.metrics.k8s.io
  Group Priority Minimum:    100
  Insecure Skip TLS Verify:  true
  Service:
    Name:            prometheus-adapter-monitor
    Namespace:       fb-poster
    Port:            443
  Version:           v1
  Version Priority:  100
Status:
  Conditions:
    Last Transition Time:  2023-06-27T19:07:53Z
    Message:               failing or missing response from https://10.82.128.155:6443/apis/custom.metrics.k8s.io/v1: bad status from https://10.82.128.155:6443/apis/custom.metrics.k8s.io/v1: 404
    Reason:                FailedDiscoveryCheck
    Status:                False
    Type:                  Available
Events:                    <none>
1

There are 1 answers

4
DanF On

Yeah I ran into this a while ago, not sure if it's the same issue but my solution was to run

gcloud compute firewall-rules create allow-prometheus-adapter \
    --action ALLOW \
    --direction INGRESS \
    --source-ranges 10.0.0.0/8 \ # Master CIDR Block Change it to fit your K8s Cluster
    --rules tcp:6443 \
    --network {{VPC Network Name}} # Get this from the UI Console

I think the main cause is that you are not allowing traffic into that port and failing the discovery check.