I deploy several flink on k8s in standalone mode , and export their metrics by one promethus-pushgateway.
The Problem is that:
the metrics Data arrives at promethus intermittently ,resultting to gaps between dots when displayed in grafana
click me, show the gapped graph
promethus target:
monitoring/pushgateway/0 (1/1 up)
Endpoint: http://172.19.88.111:9091/metrics
State : UP
Labels: endpoint="tcp" instance="172.19.88.111:9091" job="pushgateway" namespace="flink-sql" pod="pushgateway-76d64545dd-6prdn" service="pushgateway"
I query the pushgateway directly ,but can not get all metris every time
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:17 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:18 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:18 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:19 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:20 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:20 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
flink_jobmanager_numRegisteredTaskManagers{host="jobmanager",instance="",job="model"} 20
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:20 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:21 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:22 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:22 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:23 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:23 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
flink_jobmanager_numRegisteredTaskManagers{host="jobmanager",instance="",job="model"} 20
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:24 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:24 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:25 UTC 2021
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:26 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date && curl -s http://pushgateway.flink-sql:9091/metrics | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:27 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
The config in my flink-conf.yaml
metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: pushgateway.flink-sql
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: flink-sql
metrics.reporter.promgateway.randomJobNameSuffix: false
metrics.reporter.promgateway.deleteOnShutdown: false
metrics.reporter.promgateway.interval: 3 SECONDS
even set promethus Scrape interval
metrics.reporter.promgateway.interval
to 1 second , no effect ;
I guess that:
The gapped graph results from promethus does not have successive data stored.
Prometheus' metrics data are from PushaGateWay.
PushGateWay's metrics data are from JobManager/TaskManager.
The data reported from JobManager/TaskManager to PushaGateWay is not cached by PushaGateWay.
So when promethus query Pushgateway periodly,it only get what pushgateway has on the moment , not all the data JobManager/TaskManager reported .
What I experienced seems this way,But it's not conclusive .PushGateWay must play a role after all . Of course ,It's not considered Whether Flink's metric reportor has reported data periodly as expected or not
Now I get gap problem solveed by new solution that promethus scrape data from Jobmanage/Taskmanager directly .