CloudWatch Alarm to catch consecutive failures

23 views Asked by At

I have a pipeline that runs a job every 6hr. if the job succeeds, it emits metric 0, if the job fails, it emits metric 1. I have a CW alarm to notify us if the job has been failing for 2 consecutive runs (at least that's the intention).

Alarm definition:
  • Threshold: ${Metric} > 0 for 2 datapoints within 12 hours
  • Period: 6hr
  • Datapoints to alarm: 2 out of 2
Job run history (UTC):
date start - end status
01-09 00:14 - 01:08 succeeded
01-09 06:14 - 06:26 failed
01-09 12:14 - 13:07 succeeded
01-09 18:14 - 18:53 failed
01-10 00:14 - 00:51 failed
01-10 06:14 - 07:08 succeeded
Corresponding CW metric emitted

(alarm is checking period of 6hr, but I collected more precise time using period of 1min):

  • 01-09 01:08 0 (success)
  • 01-09 06:26 1 (failure)
  • 01-09 13:07 0 (success)
  • 01-19 18:53 1 (failure)
  • 01-10 00:51 1 (failure)
  • 01-10 07:12 0 (success)
Alarm Behavior

Yesterday and today the alarm got triggered.

For alarm trigger at 01-10 00:54, I can sort of explain by checking the stateReasonData - At 01-10 00:54, CW checked past 12hr data (from 01-09 12:54 to 01-10 00:54). In this time window, there are 2 failure datapoints (datapoint 1: 01-09 18:53 and datapoint 2: 01-10 00:51), and alarm triggered.

      "stateReasonData": {
        "version": "1.0",
        "queryDate": "2024-01-10T00:54:38.242+0000",
        "startDate": "2024-01-09T12:54:00.000+0000",
        "period": 21600,
        "recentDatapoints": [
          1,
          1
        ],
        "threshold": 0,
        "evaluatedDatapoints": [
          {
            "timestamp": "2024-01-09T18:54:00.000+0000",
            "sampleCount": 1,
            "value": 1
          },
          {
            "timestamp": "2024-01-09T12:54:00.000+0000",
            "sampleCount": 2,
            "value": 1
          }
        ]
      }

But for alarm trigger at 10-09 18:54, I can’t really explain the behavior:

  • Q1: why is CW looking back 18hrs instead of 12hrs in this case? (note startDate is 00:54 and queryDate is 18:54)
  • Q2: why is success datapoint recognized as null? (note recentDatapoints: [1, null, 1] below)
    "newState": {
      "stateValue": "ALARM",
      "stateReason": "Threshold Crossed: 2 out of the last 2 datapoints [1.0 (09/01/24 12:54:00), 1.0 (09/01/24 00:54:00)] were greater than the threshold (0.0) (minimum 2 datapoints for OK -> ALARM transition).",
      "stateReasonData": {
        "version": "1.0",
        "queryDate": "2024-01-09T18:54:38.250+0000",
        "startDate": "2024-01-09T00:54:00.000+0000",
        "period": 21600,
        "recentDatapoints": [
          1,
          null,
          1
        ],
        "threshold": 0,
        "evaluatedDatapoints": [
          {
            "timestamp": "2024-01-09T12:54:00.000+0000",
            "sampleCount": 2,
            "value": 1
          },
          {
            "timestamp": "2024-01-09T00:54:00.000+0000",
            "sampleCount": 2,
            "value": 1
          }
        ]
      }
    }
0

There are 0 answers