So our server consists of users, and each user may select one of the 3rd party services we provide to communicate with.
Each 3rd party service has a different size of user population communicating with it through our system (and increasing):
- Service (A) might has 30k users
- Service (B) has 5k
- Service (C) has 100k
We want to create an alert whenever any of these services are down (meaning monitoring 500s).
We send a metric from a central networking point in our code when 500 occurs, includes the url of the service as a tag.
A couple of constraints:
We prefer to create just one monitor that catches all and reports each service individually (so if service A and B are down, we get 2 alerts). We don't want to create multiple monitors for the same purpose to monitor different services (and maybe create a composite monitor) because the services we communicate with might increase in the future.
We don't want to explicitly set a
thresholdon the number of 500s on the single monitor we create, above which the monitor sends an alert, because each service has a different size of user population, so 10 occurrences in 10 mins of 500 for Service (C) (has 100k) shouldn't be considered as service down, compared to Service (B) (has 5k).
I thought of using Outlier or Anomaly monitors but we're trying to figure out the best configuration for it to avoid any false positives. So changing the Outlier algorithm between DBSCAN and MAD sometimes yield nothing and changing the tolerance yields false positives.
This is with DBSCAN, tolerance 3.0 - the big spike is not detected

tolerances till 1.0 detects nothing, but 0.5 detects everything, which might be false positives

Same behavior with MAD algorithms , there's no specific tolerance to catch the correct values
Any recommendations regarding the configuration above is welcome, or even if you think there should be a different kind of a monitor used.
Multi Alertmonitor to alert for each service that meets the threshold.https://docs.datadoghq.com/monitors/configuration/?tab=thresholdalert#alert-grouping
Example:
Service A(100 users) - 10 errors/100 users = 0.1.
Service B(2000 users) - 10 errors/2000 users = 0.005.
So, if you set a threshold of >= 0.1, Service A would alert when there are 10 or more errors and service B would alert only when there are 200 or more errors.