I have a pipeline that runs a job every 6hr. if the job succeeds, it emits metric 0, if the job fails, it emits metric 1. I have a CW alarm to notify us if the job has been failing for 2 consecutive runs (at least that's the intention).
Alarm definition:
- Threshold: ${Metric} > 0 for 2 datapoints within 12 hours
- Period: 6hr
- Datapoints to alarm: 2 out of 2
Job run history (UTC):
| date | start - end | status |
|---|---|---|
| 01-09 | 00:14 - 01:08 | succeeded |
| 01-09 | 06:14 - 06:26 | failed |
| 01-09 | 12:14 - 13:07 | succeeded |
| 01-09 | 18:14 - 18:53 | failed |
| 01-10 | 00:14 - 00:51 | failed |
| 01-10 | 06:14 - 07:08 | succeeded |
Corresponding CW metric emitted
(alarm is checking period of 6hr, but I collected more precise time using period of 1min):
- 01-09 01:08 0 (success)
- 01-09 06:26 1 (failure)
- 01-09 13:07 0 (success)
- 01-19 18:53 1 (failure)
- 01-10 00:51 1 (failure)
- 01-10 07:12 0 (success)
Alarm Behavior
Yesterday and today the alarm got triggered.
For alarm trigger at 01-10 00:54, I can sort of explain by checking the stateReasonData - At 01-10 00:54, CW checked past 12hr data (from 01-09 12:54 to 01-10 00:54). In this time window, there are 2 failure datapoints (datapoint 1: 01-09 18:53 and datapoint 2: 01-10 00:51), and alarm triggered.
"stateReasonData": {
"version": "1.0",
"queryDate": "2024-01-10T00:54:38.242+0000",
"startDate": "2024-01-09T12:54:00.000+0000",
"period": 21600,
"recentDatapoints": [
1,
1
],
"threshold": 0,
"evaluatedDatapoints": [
{
"timestamp": "2024-01-09T18:54:00.000+0000",
"sampleCount": 1,
"value": 1
},
{
"timestamp": "2024-01-09T12:54:00.000+0000",
"sampleCount": 2,
"value": 1
}
]
}
But for alarm trigger at 10-09 18:54, I can’t really explain the behavior:
- Q1: why is CW looking back 18hrs instead of 12hrs in this case? (note startDate is 00:54 and queryDate is 18:54)
- Q2: why is success datapoint recognized as null? (note recentDatapoints: [1, null, 1] below)
"newState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 2 out of the last 2 datapoints [1.0 (09/01/24 12:54:00), 1.0 (09/01/24 00:54:00)] were greater than the threshold (0.0) (minimum 2 datapoints for OK -> ALARM transition).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2024-01-09T18:54:38.250+0000",
"startDate": "2024-01-09T00:54:00.000+0000",
"period": 21600,
"recentDatapoints": [
1,
null,
1
],
"threshold": 0,
"evaluatedDatapoints": [
{
"timestamp": "2024-01-09T12:54:00.000+0000",
"sampleCount": 2,
"value": 1
},
{
"timestamp": "2024-01-09T00:54:00.000+0000",
"sampleCount": 2,
"value": 1
}
]
}
}