SLA Calculation with PromQL

39 views Asked by At

I have a time-series:

sum(ALERTS{alertname="IngestionStopped", alertstate="firing"} unless on(table) (ALERTS{alertname="MyAlert1",alertstate="firing"} OR ALERTS{alertname="MyAlert2",alertstate="firing"}) OR vector(0))

enter image description here

I do a sum because I have 1 TS for each partition of the table. I am interested even if a single partition has its ingestion stopped.

This TS = 0 when my service. is working fine. If it's > 0, it means there's something wrong with the server. I want to calculate the % of time my service was not working fine (meaning this TS was > 0). How can I do that?

1

There are 1 answers

0
markalex On

For any query that produces continuous output of value 0 or 1 you can count an average over time using function of avg_over_time, like this:

avg_over_time( (<your_query>) [range:resolution] )

Where range is time range over which you want to calculate average, and resolution is how often sample of your query should be evaluated within range.
resolution can also be omitted (without omitting :). In that case global evaluation interval (evaluation_interval from config, by default 1m) will be used as a default value.

Since your query produces values other then 1, that for intents of this exercise should be treated as 1, it can be modified by adding > bool 0. It uses boolean comparison to convert all values that satisfy the condition into 1.

So final query would be

avg_over_time(
 (
  sum(
   ALERTS{alertname="IngestionStopped", alertstate="firing"}
   unless on(table) (
     ALERTS{alertname="MyAlert1", alertstate="firing"}
     or ALERTS{alertname="MyAlert2", alertstate="firing"})
   or vector(0))
   > bool 0
 )
 [30d:1m] 
)

Adjust resolution according to your situation, but remember that alert rules are evaluated (and subsequently metric ALERTS updated) only once every evaluation_interval, so no need to go crazy low there.

Demo of similar query can be seen here.