We have developed a backend system using Nest.Js that is currently hosted on Heroku. we are experiencing sometimes crashes by H12 Request timeout errors. These crashes occur randomly and are not tied to any specific part of our application.
Every function within our system executes queries and responds within 10-25ms, which we monitor using Sentry and Atatus. Despite this, we receive H12 Request timeout errors from Heroku, indicating that some requests are taking too long to complete. Our logging tools have been unable to pinpoint the source of the issue, as crashes seem to occur at random points in the application.
What We Have Tried
1.Logging and Monitoring: We have implemented detailed logging with Sentry and Atatus. However, neither tool has revealed the cause of the timeouts or crashes.
2.Review of Performance Metrics: We've analyzed memory usage and general performance metrics, all of which appear normal.
3.Investigation of Individual Functions: Given that the crashes appear random and are not isolated to any specific function, we have reviewed the code for potential inefficiencies or errors but found none that could explain the intermittent H12 errors.
Any advice on additional debugging strategies, monitoring approaches, or Heroku-specific configurations that could help us identify and resolve the root cause of these intermittent crashes would be greatly appreciated.
There are several other reasons why you might be getting an
H12.No healthchecks on individual dynos
Heroku doesn't have the ability to perform health checks on dynos. The Heroku Load Balancers don't use a user-configurable endpoint to perform health checks. If there is an issue in one dyno, it would still keep receiving requests even though it may be malfunctioning and unable to serve requests.
Routing requests before applications boot up
Because Heroku lacks a healthcheck functionality, it also is not able to discern whether an application has booted up completely. Heroku considers an application to have booted if it's server is binded to its specified port. Often a web app would bind to its port but continue loading required libraries, classes, etc. well after that.
Instrumentation sampling rate
Often APM tools don't record all traffic, but sample requests. This sampling rate might well below 1%. If you have requests that are genuinely slow, but extremely rare (think 0.01% of all), they will only end up in the APM with a probability of 0.0001%. Sentry specifically has a parameter called:
tracesSampleRate. It's1by default, but I have never had it set to1on a production server. Check if your tool can be configured to explicitly capture outliers and sample outliers at 100% or consider temprary setting the sampling rate at 100% even though it's costly to do so.Faulty request breaking instrumentation
If a requests results in a crash for example (and yes, that is possible if there are bugs in packages with binary extensions), this will likely bring down the whole server along with its instrumentation. The request may be left hanging hence why you get an
H12. I had an issue like that with Ruby several years ago. Sometimes crashes are caused by malicious requests from security scanners that try to exploit your app. Check your logs specifically for a dyno restarts that coincide withH12events.In general, unless you know that customers are affected you might want to ignore these. There are always outliers due to various reasons including intermittent network and client errors. As long as these are extremely rare it might be safe to ignore.