How to debug spring boot application not starting

433 views Asked by At

Spring lists SO as the only place to ask questions on their community page, which is why I ask this rather generic question here. It may not be the best fit for SO, but, according to Spring's community overview page, there's no other adequate place to ask such questions.

I have a spring boot application built on spring cloud gateway (version 2) which also uses an embedded hazelcast cluster. It runs in multiple instances, which communicate via hazelcast. Everything works fine, except under heavy load. If one instance fails, restarting it is no longer possible.

When the instance is restarted while the cluster of instances is under heavy load, it will start creating and wiring beans, up to some point, after which it will not do anything spring-related anymore. Hazelcast-generated messages are visible in the log (with root log level DEBUG), past that point, but nothing generated by spring or the application itself.

In order to restart that one instance that failed, I need to stop the load generation, wait some 10-15 minutes, then restart the failed instance. Then the new/restarted instance starts up rather quickly, with no problems at all.

The load consists of http requests which get proxied to another application, and is of such nature that it generates a lot of read accesses to hazelcast's distributed storage, but very few writes.

My problem: I have no idea how to debug this. Since the http endpoint never becomes available, there's no way I can query metrics or other actuator information.

So my question is: what tools or mechanisms can I employ to debug this problem? I.e. how can I find out exactly how the boot sequence under heavy load of the other instances of the hazelcast cluster differs from the boot sequence when there is no load at all in the cluster? Once I have this information, the problem is narrowed down enough for me to investigate it further on my own.

1

There are 1 answers

0
user625488 On

I didn't find a way to debug the problem, but had an idea of what might cause it, tried it, and it was a fix.

My application was running as a Kubernetes deployment. A few beans inside the application were relying on a usable CP subsystem during their initialization. Spring's bean initialization process is by necessity sequential and blocking, to account for inter-bean dependencies.

I hypothesized that under heavy load, for whatever reason, the initialization of those beans was blocking forever. As a first experiment, I made that initialization code async, so that Spring can finish bean wiring, even if, until that async part finished too, the instance was unable to perform usable work, to see if that was the problem, at least.

To my surprise, that fully fixed the problem. This way, Spring finished bean wiring, the HZ-dependant initialization also finished rather quickly, when executed async, even under high load, and the instance became usable soon after being started.

I didn't have the time to dig deeper to find out what the precise failure mechanism was. What I believe might have been the problem is the interaction between HZ and K8s. K8s-based discovery works using a K8S service. A pod/instance isn't added to the service until it becomes healthy. If a bean inside the application prevents initialization, the instance is never added to the service. As such, discovery never finds the new/restarted instance. I don't know what effect this might have on the HZ cluster's inner workings.