I'm running the Tanuki Wrapper (and have been for a very very very long time). In production, it is working great, but over that few weeks I'm getting reports that the wrapper process (the C code) is hung and won't die which is causing production issues.
When I'm alerted and I take a look here is what I'm seeing:
1) The child java process was killed with SIGKILL/9 a few hours back
STATUS | wrapper | 2016/02/08 03:49:20 | JVM received a signal SIGKILL (9).
2) Then I see that a wrapper.sh stop
was issues by my custom built internal watcher process to reset it, but that is entering an infinite loop as documented below: code link
stopit() {
[snip]
kill $pid
[snip]
# MY NOTE It never gets out of this, the kill doesn't work
# We can not predict how long it will take for the wrapper to
# actually stop as it depends on settings in wrapper.conf.
# Loop until it does.
savepid=$pid
CNT=0
TOTCNT=0
while [ "X$pid" != "X" ]
do
# Show a waiting message every 5 seconds.
if [ "$CNT" -lt "5" ]
then
CNT=`expr $CNT + 1`
else
eval echo `gettext 'Waiting for $APP_LONG_NAME to exit...'`
CNT=0
fi
TOTCNT=`expr $TOTCNT + 1`
sleep 1
testpid
done
[ SNIP ]
fi
}
3) I then log onto the box and find the wrapper process pid (remember the JVM is long dead) and issue a direct kill $pid, and wait... nothing. possible code?
4) Finally give up and issue kill -9 $pid and that finally kills it and everything cleans up and comes back alive.
QUESTIONS:
How do I trouble shoot an app where kill $pid (SIGTERM/15) does not work? This worked great for YEARS and still is on many other process, but on just a few it is failing.
Of course most of the questions and documetation on Tanuki are about how to manipulate/interrogate the child JVM, but I'm actually seeing a problem with what I assume is the C code and I'm not sure how to interrogate the hung PID for the C code to give up the secrets. Maybe something in /proc/$pid
can tell me what it is hung on?
Help me Obi-Wan Kenobi, your my only hope...
Leif from Tanuki Software
The most likely cause of the JVM being unexpectedly killed with a SIGKILL is that the OS is out of resources and killed the process. When this happens, Java is often the biggest user of memory so it gets nailed. Please check the syslog as there should be an entry at the same time if that is the cause.
Even if that happens however, the Wrapper should be handling this correctly and restarting your JVM. It sounds like the Wrapper has gotten itself into an unexpected state and is not responding to normal signals itself. What is the version of the Wrapper that you are using? I double checked the release notes but don't think we have seen this exact problem before. http://wrapper.tanukisoftware.com/doc/english/release-notes.html
Please let me know what you find in your syslog at the time the JVM was killed.