Scenario
I am have multiple NodeJS scripts running in forever mode in Ubuntu OS. One of these files(start.js) imports a file that starts a ZMQ publisher by biding it to a specified port. When I start this start.js file in forever mode separately, it binds and starts the publisher, and I am able to fetch the data published by this publisher through a ZMQ subscriber that connects to this port.
I am closing the publisher gracefully by checking for exit, SIGINT and SIGUSR events.
Whenever I restart this start.js file alone using forever restart
, the publisher binds and starts successfully. It also works fine if I stop it manually (using forever stop
) and start it again using forever start
[ also works in the case where I manually stop(using forever stopall
) and start all the forever scripts one by one].
NOTE: All the forever stop and restart commands are run with CLI option --killSignal=SIGINT.
Problem
But the publisher is failing to bind when I do forever restartall --killSignal=SIGINT
. It says that the address is already in use(I have checked this using netstat
and there is no tcp socket at that port). When I stop all the scripts and start them one by one it binds back normally and starts successfully.
I have checked that these kill signals are caught by the publisher script and its closing the publisher socket before exiting.
Failed Attempts:
Lowered the TIME_WAIT state of the tcp sockets.
Enabled reuse of TIME_WAIT sockets.
I thought that the tcp socket is taking time to get released from TIME_WAIT state, and tried to bind the publisher after 1000ms on every failure to bind, but the scripts is trying to bind and failing every time it tries.
Tried forever restarting all the scripts using SIGINT, SIGUSR1 kill signals and handled them in the script that binds the publisher socket.
This is how I am handling the SIG* events in the publisher:
process.stdin.resume();
function exitHandler(options, err){
if (options.cleanup) console.log('pub-clean');
if (err) console.log("pub--" + err.stack);
if (options.exit){
socket.close();
console.log("Publisher Closed")
process.exit();
}
}
process.on('exit', exitHandler.bind(null,{cleanup:true}));
process.on('SIGINT', exitHandler.bind(null, {exit:true}));
process.on('uncaughtException', exitHandler.bind(null,{exit:true}));
process.on('SIGUSR2', exitHandler.bind(null, {exit:true}));
process.on('SIGTERM', exitHandler.bind(null, {exit:true}));
Why the forever restarting all the scripts is causing the publisher script to fail to bind?
What can be done to make the publisher script to bind on forever restarting?
ZeroMQ-resources are recommended to be released in a controlled way
As discussed in the comments above, a truly graceful release of ZeroMQ resources is not done via system-level
SIG*
/*KILL
, but by executing the ZeroMQ-recommended graceful-release steps.As posted so far, you do not do that at all in your code and thus the ZeroMQ-resources may and most probably remain hanging ( at least the I/O-thread seems to ).
Check your ZeroMQ-socket settings used in ( not yet posted ) setup (
.setsockopt()
calls used in setup phase ) and add:.close()
of all sockets setup ( be they used, or not ).close()
only after [1] is sure and validContext
instance.term()
This is considered a guaranteed ZeroMQ-graceful-release of all ( internally handled ) resources.
On a sample code request:
A graceful release
On missing "in-built" controls
One may extend the architecture so as to contain one's own soft-signalling code for all the situations, that need to get handled softer, than via
SIGKILL
et al.and