AWS Elasticsearch publishing wrong total request metric

643 views Asked by At

We have an AWS Elasticsearch cluster setup. However, our Error rate alarm goes off at regular intervals. The way we are trying to calculate our error rate is:

((sum(4xx) + sum(5xx))/sum(ElasticsearchRequests)) * 100

However, if you look at the screenshot below, at 7:15 4xx was 4, however ElasticsearchRequests value is only 2. Based on the metrics info on AWS Elasticsearch documentation page, ElasticsearchRequests should be total number of requests, so it should clearly be greater than or equal to 4xx.

Can someone please help me understand in what I am doing wrong here?

enter image description here

2

There are 2 answers

0
Sean Canning On

I know this was posted a while back but I've additionally struggled with this issue and maybe I can add a few pointers.

First off, make sure your metrics are properly configured. For instance, some responses (4xx for example) take up to 5 minutes to register, while OpensearchRequests are refershed every minute. This makes for a very wonky graph that will definitely throw off your error rate. enter image description here

In the picture above, I send a request that returns 400 every 5 seconds, and send a response that returns 200 every 0.5 seconds. The period in this case is 1 minute. This makes it so on average it should be around a 10% error rate. As you can see by the green line, the requests sent are summed up every minute, whereas the the 4xx are summed up every 5 minute, and in between every minute they are 0, which makes for an error rate spike every 5 minutes (since the opensearch requests are not multiplied by 5).

In the next image, the period is set to 5 minutes. Notice how this time the error rate is around 10 percent.

enter image description here

When I look at your graph, I see metrics that look like they are based off of a different period.

The second pointer I may add is to make sure to account for when no data is coming in. The behavior the alarm has may vary based on your how you define the "treat missing data" parameter. In some cases, if no data comes in, your expression might make it so it stays in alarm when in fact there is only no new data coming in. Some metrics might return no value when no requests are made, while some may return 0. In the former case, you can use the FILL(metric, value) function to specify what to return when no value is returned. Experiment with what happens to your error rate if you divide by zero.

Hope this message helps clarify a bit.

0
Patrick Decat On

AWS definitions of these metrics are:

OpenSearchRequests (previously ElasticsearchRequests): The number of requests made to the OpenSearch cluster. Relevant statistics: Sum

2xx, 3xx, 4xx, 5xx: The number of requests to the domain that resulted in the given HTTP response code (2xx, 3xx, 4xx, 5xx). Relevant statistics: Sum

Please note the different terms used for the subjects of the metrics: cluster vs domain

To my understanding, OpenSearchRequests only considers requests that actually reach the underlying OpenSearch/ElasticSearch cluster, so some the 4xx requests might not (e.g. 403 errors), hence the difference in metrics.

Also, AWS only recommends comparing 5xx to OpenSearchRequests:

5xx alarms >= 10% of OpenSearchRequests: One or more data nodes might be overloaded, or requests are failing to complete within the idle timeout period. Consider switching to larger instance types or adding more nodes to the cluster. Confirm that you're following best practices for shard and cluster architecture.