I would like to ask you for help - how can I prevent Prometheus from being killed with Out Of Memory when enabling Istio metrics monitoring? I use Prometheus Operator and the monitoring of the metrics works fine until I create the ServiceMonitors for Istio taken from this article by Prune on Medium. From the article they are as follows:
ServiceMonitor for Data Plane:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: prometheus-oper-istio-dataplane
labels:
monitoring: istio-dataplane
release: prometheus
spec:
selector:
matchExpressions:
- {key: istio-prometheus-ignore, operator: DoesNotExist}
namespaceSelector:
any: true
jobLabel: envoy-stats
endpoints:
- path: /stats/prometheus
targetPort: http-envoy-prom
interval: 15s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_container_port_name]
action: keep
regex: '.*-envoy-prom'
- action: labelmap
regex: "__meta_kubernetes_pod_label_(.+)"
- sourceLabels: [__meta_kubernetes_namespace]
action: replace
targetLabel: namespace
- sourceLabels: [__meta_kubernetes_pod_name]
action: replace
targetLabel: pod_name
ServiceMonitor for Control Plane:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: prometheus-oper-istio-controlplane
labels:
release: prometheus
spec:
jobLabel: istio
selector:
matchExpressions:
- {key: istio, operator: In, values: [mixer,pilot,galley,citadel,sidecar-injector]}
namespaceSelector:
any: true
endpoints:
- port: http-monitoring
interval: 15s
- port: http-policy-monitoring
interval: 15s
After the ServiceMonitor for Istio Data Plane is created the usage of memory goes in just a minute from around 10GB up to 30GB and the Prometheus replicas are killed by Kubernetes. CPU usage is normal. How can prevent such a huge increase in resources usage? Is there something wrong with the relabelings? It is supposed to scrape the metrics from around 500 endpoints.
[EDIT]
From the investigation it seems that this what have a great impact on the resource usage is the relabelings. For example if I change the targetLabel to pod instead of the pod_name the resources usage grows up immediately.
Anyway, I did not find the solution to this issue. I have used the semi-official ServiceMonitor and the PodMonitor provided by the Istio on GithHub, but it just made Prometheus to run longer before Out Of Memory Exception. Now it takes around an hour to go from ~10GB to 32GB of memory usage.
This what I can see is that after enabling the Istio metrics, the number of time series grows quite fast and never stops, what in my opinion looks like the memory leak. Before enabling Istio monitoring this number is quite stable.
Do you have any other suggestions?