ECONNRESET on node.js RabbitMQ consumer in Azure

2.3k views Asked by At

I've been looking into this problem for a number of days now and it's had me totally stumped. We have a node.js based rabbitmq consumer that has been running on premise for over a year without any problems. Recently we deployed our app to Azure with the node.js component deployed to a window based PAAS worker role. We use the square mo amqp-lib (https://github.com/squaremo/amqp.node) as our client library from receiving messages from RabbitMQ. The role starts fine, processes requests without problem but periodically recycles.

Checking the WaHostBootstrapper logs in C:\resources on the deployed VM showed the following:

[00001180:00001548, 2015/06/09, 10:01:17.385, INFO ] Getting status from client RemoteAccessAgent.exe (2816).
[00001180:00001548, 2015/06/09, 10:01:17.385, INFO ] Client reported status 0.
[00001180:00001548, 2015/06/09, 10:01:17.385, INFO ] Getting status from client WaWorkerHost.exe (1380).
[00001180:00001548, 2015/06/09, 10:01:17.385, INFO ] Client reported status 3.
[00001180:00003288, 2015/06/09, 10:01:17.385, INFO ] Sending shutdown notification to client RemoteAccessAgent.exe (2816).
[00001180:00003288, 2015/06/09, 10:01:17.416, ERROR] <- CRuntimeClient::OnRoleShutdownCallback(0x0000000000331890) =0x800706be
[00001180:00003288, 2015/06/09, 10:01:17.416, INFO ] Sending shutdown notification to client WaWorkerHost.exe (1380).

There was nothing else of note in any of the other event logs on the VM. I modified the role model startup task to log all console output to a file, which showed no errors. In an attempt get more visibility on what was happening, I enabled NODE_DEBUG of the http and net modules with $ENV:NODE_DEBUG = "net,http" and then ran the startup task directly from powershell. After a period of time during which requests were handled perfectly, I got the following:

NET: 3720 destroy undefined
NET: 3720 destroy
NET: 3720 close
NET: 3720 close handle
NET: 3720 emit close
NET: 3720 afterWrite 0 { domain: null, bytes: 21, oncomplete: [Function: afterWrite] }
NET: 3720 afterWrite call cb
NET: 3720 afterWrite 0 { domain: null, bytes: 8, oncomplete: [Function: afterWrite] }
NET: 3720 afterWrite call cb
NET: 3720 onread ECONNRESET undefined undefined NaN
NET: 3720 error ECONNRESET
NET: 3720 destroy
NET: 3720 close
NET: 3720 close handle

The rabbitMQ server has the following log entry:

=WARNING REPORT==== 13-Mar-2015::17:48:39 ===
closing AMQP connection <0.7072.1> (137.116.194.234:1307 -> 10.140.42.79:5672):
connection_closed_abruptly

After seeing this, it was fairly clear to me that something between the consumer and the server, was closing the connection, resulting in the node process exiting (we were not listening to the error event from the AMQP connection) which in turn resulted in the role recycling.

I have managed to replicate this by using TCPView (sysInternals) to manually close the connection to the rabbit server.

As this only happens when deployed to Azure, I guess there must be some piece of Azure infrastructure which is closing this connection in a dirty way, resulting in the problem. But what?

1

There are 1 answers

0
Sam Shiles On BEST ANSWER

I think I've solved this! It was so tricky because of the lack of decent errors messages and the invisibility of the Azure load balancer. It turns out to be a very simple fix but one that isn't at all obvious (until you know about it, then it makes perfect sense).

Short Answer

The Azure load balancer is seeing an idle connection and killing it.

Fix

Enable heartbeats on the long running connection between the client and rabbitmq. You can do this easily by appending ?heartbeat=5 to the end of your server connection string:

amqp.connect('amqp://' + server + "?heartbeat=5", function(err, conn) {
...
}

Resources

Azure load balance background and timeouts: http://blogs.msdn.com/b/avkashchauhan/archive/2011/11/12/windows-azure-load-balancer-timeout-details.aspx

amqp-lib heartbeats: http://www.squaremobius.net/amqp.node/doc/channel_api.html