The story started with our production app server running out of http-nio threads (500), and our research revealed that certain requests to the app server (Tomcat) executed a IMap.clear() call via hazelcast client on certain IMaps in the 5-node hazelcast cluster (5.1.5), and the stacktrace of all these thread are stuck with one of two call stack below:
Stack 1
http-nio-8080-exec-394 [206713] (WAITING)
jdk.internal.misc.Unsafe.park line: not available [native method]
java.util.concurrent.locks.LockSupport.park line: 323
com.hazelcast.spi.impl.AbstractInvocationFuture.manageParking line: 693
com.hazelcast.spi.impl.AbstractInvocationFuture.get line: 615
com.hazelcast.client.impl.spi.ClientProxy.invoke line: 215
com.hazelcast.client.impl.proxy.ClientMapProxy.clear line: 1665
Stack 2
http-nio-8080-exec-340 [177343] (WAITING)
jdk.internal.misc.Unsafe.park line: not available [native method]
java.util.concurrent.locks.LockSupport.park line: 323
com.hazelcast.spi.impl.AbstractInvocationFuture.manageParking line: 693
com.hazelcast.spi.impl.AbstractInvocationFuture.get line: 615
com.hazelcast.client.impl.spi.ClientProxy.invoke line: 215
com.hazelcast.client.impl.proxy.ClientMapProxy.isEmpty line: 1569
Since our app code only calls clear() on maps of certain pattern, we thought these maps are the cause of the issue, so I wrote a small "MapPrint" program that runs from an app server or my local machine (both showed the exact same results), connecting to the same 5-node hazelcast cluster, loop through all IMap objects, and inovke isEmpty() or size() on these IMaps. I was able to reproduce the same "hanging behavior" with my program. However, through repeated execution of the MapPrint program, we observed more interesting behaviors:
- The hanging behavior can happen to not only maps of the pattern as observed in production app server, but any other maps.
- The hanging behavior can happen to maps that are not empty as well as maps that are empty.
- The hanging behavior occurred seemingly randomly, but exhibited consistent pattern
-- When the hazelcast client displayed only "Authenticated with server [ip]" to 2 of the 5 nodes in logs, (for whatever reason the other 3 nodes were not "Authenticated"), MapPrint program ALWAYS successfully ran to completion. In more than 10 such occurrences, One of the 2 "Authenticated" nodes remains a fixed node (192.168.3.100), but the other node varied.
-- When the hazelcast client displayed "Authenticated with server [ip]" to all 5 nodes in logs, MapPrint program ALWAYS got stuck on some map. With the above observation, my suspicion is that this 5-node production hazelcast cluster is in a state that different nodes have different view of the same IMaps, and certain operations such as clear()/isEmpty()/size() resulted in an internal conflicting state and cannot return when all 5 nodes are involved. When only 2 nodes are involved, somehow these operations are able to go through.
Some additional notes:
- We have an UAT 3-node hazelcast cluster running the same version 5.1.5 and connected to from UAT app servers of the same version (as production), with plenty of maps and many operations, and we don't see the same hanging behavior from that environment.
- We create many IMaps (10K in production and 6K in UAT). As part of troubleshooting, we removed (destroy()) all empty maps from production 5-node cluster and reduced the IMap count to 2K, and we can still see the hanging behavior with 2K IMaps.
- The pattern of IMap of initial discovery to be problematic are defined in hazelcast server side to have "read-backup-data: true", while all other IMaps are defined with "read-backup-data: false". This should not matter, but I want to provide the info nevertheless.
At this point, our planned action is to perform a complete shutdown and start of the entire 5-node production hazelcast cluster, with the expectation that this action will clear any bad cluster state and bring back normal operations. However, we still don't know how we got into this situation and can't be sure it will not happen again.
I am hoping someone has run into this scenario before or know of some reason to explain what we are experiencing and offer some kind of solution. Any feedback will be appreciated.
Link to log files
https://github.com/hazelcast/hazelcast/files/14120293/diagnostics-client-0-1706724479018-000.log
https://github.com/hazelcast/hazelcast/files/14120297/thread.dump.txt