RabbitMQ Node died with proxied Nginx server

179 views Asked by At

We have a RabbitMQ cluster that was set up by someone that is no longer present, and I do not know enough about it to fix what is wrong. Only way I knew it was down was because of a programmatic error alert for upstart services that use the queues.

This is all I know so far:

On the proxied server (Nginx 1.10.3, Ubuntu 16.04):

In /etc/nginx/nginx.conf, I have:

stream {

    # debug|info|notice|warn|error|crit|alert|emerg
    error_log  /var/log/nginx/stream_error.log info;

    server {
        listen 192.168.70.11:5672 so_keepalive=on;
        proxy_pass rabbitmq_backend;

    }

    upstream rabbitmq_backend {
        server services-01:5672;
        #server services-00:5672;   <--- Commented this out 
    }

}

On the server that is running the upstart services in one of the logs for example, /var/log/upstart/<service-name>.log:

2021-08-21 05:33:10NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:33:11NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:33:11NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:33:11NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:12NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:12NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:13NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:13NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:13NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:13NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible

On the RabbitMQ dashboard, I see this:

RabbitMQ Node not running

At the top of the dashboard, I see:

Rabbit MQ Network Partition Detected

The two cluster nodes are in docker containers on two separate servers (services-00 and services-01). On the node that is reporting the problem, inside the container, I found the command rabbitmqctl cluster_status. It gives me:

root@rabbit-services-00:/# rabbitmqctl cluster_status
Cluster status of node 'rabbit@rabbit-services-00' ...
[{nodes,[{disc,['rabbit@rabbit-services-00','rabbit@rabbit-services-01']}]},
 {running_nodes,['rabbit@rabbit-services-00']},
 {cluster_name,<<"rabbit@rabbit-services-01">>},
 {partitions,[{'rabbit@rabbit-services-00',['rabbit@rabbit-services-01']}]},
 {alarms,[{'rabbit@rabbit-services-00',[]}]}]

But I am unsure of how to interpret it, at least in the short time I have to try and get this running. Any assistance would be greatly appreciated. I've been trying to work on this since 1am, and I've run out of options. I figured commenting out the problem node in the Nginx config would've helped, but it has not. The upstart services keep repeatedly dying when I restart.

1

There are 1 answers

0
DevOpsSauce On

I traced this to a switch blip (Thank you Nagios). I did the following from within the container:

docker exec rabbit-services-00 rabbitmqctl stop_app

docker exec rabbit-services-00 rabbitmqctl start_app

Then I rechecked the queues, and they started working again. It appears it was just a short network outage.