Prometheus Operator - OOM killed when enabling Istio monitoring

1.8k views Asked by At

I would like to ask you for help - how can I prevent Prometheus from being killed with Out Of Memory when enabling Istio metrics monitoring? I use Prometheus Operator and the monitoring of the metrics works fine until I create the ServiceMonitors for Istio taken from this article by Prune on Medium. From the article they are as follows:

ServiceMonitor for Data Plane:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prometheus-oper-istio-dataplane
  labels:
    monitoring: istio-dataplane
    release: prometheus
spec:
  selector:
    matchExpressions:
      - {key: istio-prometheus-ignore, operator: DoesNotExist}
  namespaceSelector:
    any: true
  jobLabel: envoy-stats
  endpoints:
  - path: /stats/prometheus
    targetPort: http-envoy-prom
    interval: 15s
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_container_port_name]
      action: keep
      regex: '.*-envoy-prom'
    - action: labelmap
      regex: "__meta_kubernetes_pod_label_(.+)"
    - sourceLabels: [__meta_kubernetes_namespace]
      action: replace
      targetLabel: namespace
    - sourceLabels: [__meta_kubernetes_pod_name]
      action: replace
      targetLabel: pod_name

ServiceMonitor for Control Plane:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prometheus-oper-istio-controlplane
  labels:
    release: prometheus
spec:
  jobLabel: istio
  selector:
    matchExpressions:
      - {key: istio, operator: In, values: [mixer,pilot,galley,citadel,sidecar-injector]}
  namespaceSelector:
    any: true
  endpoints:
  - port: http-monitoring
    interval: 15s
  - port: http-policy-monitoring
    interval: 15s

After the ServiceMonitor for Istio Data Plane is created the usage of memory goes in just a minute from around 10GB up to 30GB and the Prometheus replicas are killed by Kubernetes. CPU usage is normal. How can prevent such a huge increase in resources usage? Is there something wrong with the relabelings? It is supposed to scrape the metrics from around 500 endpoints.


[EDIT]

From the investigation it seems that this what have a great impact on the resource usage is the relabelings. For example if I change the targetLabel to pod instead of the pod_name the resources usage grows up immediately.

Anyway, I did not find the solution to this issue. I have used the semi-official ServiceMonitor and the PodMonitor provided by the Istio on GithHub, but it just made Prometheus to run longer before Out Of Memory Exception. Now it takes around an hour to go from ~10GB to 32GB of memory usage.

This what I can see is that after enabling the Istio metrics, the number of time series grows quite fast and never stops, what in my opinion looks like the memory leak. Before enabling Istio monitoring this number is quite stable.

Do you have any other suggestions?

0

There are 0 answers