I am dealing with many short lived same jobs (many instances of the same process per hour), for which Prometheus does not have time to scrape, which is a valid use-case for the push gateway
My use case is that I want an error indication which count (of Gauge) these jobs.
As I understood, pushing a new value to the metric will override the previous one. And looking at the code for example in a python library for Gauge.inc()
takes its value of the current process which is reset for each job run, hence, not providing a total count.
I see the following options to create a proper counter:
- add a
job_instance
tag and sum when creating dashboards/alerts. The issue I see is that the metrics are not cleared so, running many jobs/instances will blow up the cache. - to overcome blowing up the cache, send delete requests periodically - this feels like a major hack
- query the metric upfront and increment. Besides possible timing/concurrency and dependency issues, I did not found an endpoint exposing these.
- use any other different approach
What would be the proper way to create a counter which can be counted over multiple same process?
Use prom-aggregation-gateway instead. It's tailer made for this kind of use case. From the README: