I am having an issue with cAdvisor where not all metrics are being reliably returned when I query its metrics endpoint. Specifically, querying container_fs_limit_bytes{device=~"^/dev/.*$",id="/",kubernetes_io_hostname=~"^.*"}
through Prometheus often displays results for only a fraction of the nodes in my Kubernetes cluster. This happens when the corresponding metrics are not scraped for over 5mins (due to the metrics becoming stale), but I'm not sure why all metrics are not being displayed every time the endpoint is queried successfully.
Curling the endpoint over and over again shows that some metrics are only returned at particular times and so the above Prometheus query will return data for all nodes only if it happens to scrape them once within the last 5mins, but more often than not this is not the case.
One workaround is to take the average of the metric over a longer period than 5mins, but this is not ideal.
kubectl version:
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.4", GitCommit:"793658f2d7ca7f064d2bdf606519f9fe1229c381", GitTreeState:"clean", BuildDate:"2017-08-17T08:48:23Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.3+coreos.0", GitCommit:"42de91f04e456f7625941a6c4aaedaa69708be1b", GitTreeState:"clean", BuildDate:"2017-08-07T19:44:31Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Prometheus version: 1.7.1
Prometheus configuration:
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 1m
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scheme: http
timeout: 10s
rule_files:
- /etc/prometheus-rules/alert.rules
scrape_configs:
- job_name: kubernetes-nodes
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- api_server: null
role: node
namespaces:
names: []
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
relabel_configs:
- source_labels: []
separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
action: labelmap
- source_labels: []
separator: ;
regex: (.*)
target_label: __address__
replacement: kubernetes.default.svc:443
action: replace
- source_labels: [__meta_kubernetes_node_name]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}:4194/proxy/metrics
action: replace
metric_relabel_configs:
- source_labels: [id]
separator: ;
regex: ^/machine\.slice/machine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$
target_label: rkt_container_name
replacement: ${2}-${1}
action: replace
- source_labels: [id]
separator: ;
regex: ^/system\.slice/(.+)\.service$
target_label: systemd_service_name
replacement: ${1}
action: replace
This is a known bug in how cAdvisor is using the Prometheus client libraries.