Alert on error rate exceeding threshold using Azure Insights and/or Analytics

589 views Asked by At

I'm sending customEvents to Azure Application Insights that look like this:

timestamp                 |  name          |  customDimensions
----------------------------------------------------------------------------
2017-06-22T14:10:07.391Z  |  StatusChange  |  {"Status":"3000","Id":"49315"}
2017-06-22T14:10:14.699Z  |  StatusChange  |  {"Status":"3000","Id":"49315"}
2017-06-22T14:10:15.716Z  |  StatusChange  |  {"Status":"2000","Id":"49315"}
2017-06-22T14:10:21.164Z  |  StatusChange  |  {"Status":"1000","Id":"41986"}
2017-06-22T14:10:24.994Z  |  StatusChange  |  {"Status":"3000","Id":"41986"}
2017-06-22T14:10:25.604Z  |  StatusChange  |  {"Status":"2000","Id":"41986"}
2017-06-22T14:10:29.964Z  |  StatusChange  |  {"Status":"3000","Id":"54234"}
2017-06-22T14:10:35.192Z  |  StatusChange  |  {"Status":"2000","Id":"54234"}
2017-06-22T14:10:35.809Z  |  StatusChange  |  {"Status":"3000","Id":"54234"}
2017-06-22T14:10:39.22Z   |  StatusChange  |  {"Status":"1000","Id":"74458"}

Assuming that status 3000 is an error status, I'd like to get an alert when a certain percentage of Ids end up in the error status during the past hour.

As far as I know, Insights cannot do this by default, so I would like to try the approach described here to write an Analytics query that could trigger the alert. This is the best I've been able to come up with:

customEvents
| where timestamp > ago(1h)
| extend isError = iff(toint(customDimensions.Status) == 3000, 1, 0)
| summarize failures = sum(isError), successes = sum(1 - isError) by timestamp bin = 1h
| extend ratio = todouble(failures) / todouble(failures+successes)
| extend failure_Percent = ratio * 100
| project iff(failure_Percent < 50, "PASSED", "FAILED")

However, for my alert to work properly, the query should:

  1. Return "PASSED" even if there are no events within the hour (another alert will take care of the absence of events)
  2. Only take into account the final status of each Id within the hour.

As the request is written, if there are no events, the query returns neither "PASSED" nor "FAILED".

It also takes into account any records with Status == 3000, which means that the example above would return "FAILED" (5 out of 10 records have Status 3000), while in reality only 1 out of 4 Ids ended up in error state.

Can someone help me figure out the correct query?

(And optional secondary questions: Has anyone setup a similar alert using Insights? Is this a correct approach?)

2

There are 2 answers

1
EranG On BEST ANSWER

As mentioned, since you're only querying on a singe hour your don't need to bin the timestamp, or use it as part of your aggregation at all.
To answer your questions:

  1. The way to overcome no data at all would be to inject a synthetic row into your table which will translate to a success result if no other result is found
  2. If you want your pass/fail criteria to be based on the final status for each ID, then you need to use argmax in your summarize - it will return the status corresponding to maximal timestamp.

So to wrap it all up:

customEvents
| where timestamp > ago(1h)
| extend isError = iff(toint(customDimensions.Status) == 3000, 1, 0)
| summarize argmax(timestamp, isError) by tostring(customDimensions.Id) 
| summarize failures = sum(max_timestamp_isError), successes = sum(1 - max_timestamp_isError)
| extend ratio = todouble(failures) / todouble(failures+successes)
| extend failure_Percent = ratio * 100
| project Result = iff(failure_Percent < 50, "PASSED", "FAILED"), IsSynthetic = 0
| union (datatable(Result:string, IsSynthetic:long) ["PASSED", 1])
| top 1 by IsSynthetic asc 
| project Result 

Regarding the bonus question - you can setup alerting based on Analytics queries using Flow. See here for a related question/answer

0
John Gardner On

I'm presuming that the query returns no rows if you have no data in the hour, because the timestamp bin = 1h (aka bin(timestamp,1h)) doesn't return any bins?

but if you're only querying the last hour, i don't think you need the bin on timestamp at all?

without having your data it's hard to repro exactly but... you could try something like (beware syntax errors):

customEvents
| where timestamp > ago(1h)
| extend isError = iff(toint(customDimensions.Status) == 3000, 1, 0)
| summarize totalCount = count(), failures = countif(isError == 1), successes = countif(isError ==0) 
| extend ratio = iff(totalCount == 0, 0, todouble(failures) / todouble(failures+successes))
| extend failure_Percent = ratio * 100
| project iff(failure_Percent < 50, "PASSED", "FAILED")

hypothetically, getting rid of the hour binning should just give you back a single row here of

totalCount = 0, failures = 0, successes = 0, so the math for failure percent should give you back 0 failure ratio, which should get you "PASSED".

without being to try it i'm not sure if that works or still returns you no row if there's no data?

for your second question, you could use something like

let maxTimestamp = toscalar(customEvents where timestamp > ago(1h)
| summarize max(timestamp));
customEvents | where timestamp == maxTimestamp ...
// ... more query here

to get just the row(s) that have that have a timestamp of the last event in the hour?