We have a RabbitMQ cluster that was set up by someone that is no longer present, and I do not know enough about it to fix what is wrong. Only way I knew it was down was because of a programmatic error alert for upstart services that use the queues.
This is all I know so far:
On the proxied server (Nginx 1.10.3, Ubuntu 16.04):
In /etc/nginx/nginx.conf, I have:
stream {
# debug|info|notice|warn|error|crit|alert|emerg
error_log /var/log/nginx/stream_error.log info;
server {
listen 192.168.70.11:5672 so_keepalive=on;
proxy_pass rabbitmq_backend;
}
upstream rabbitmq_backend {
server services-01:5672;
#server services-00:5672; <--- Commented this out
}
}
On the server that is running the upstart services in one of the logs for example, /var/log/upstart/<service-name>.log:
2021-08-21 05:33:10NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:33:11NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:33:11NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:33:11NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:12NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:12NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:13NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:13NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:13NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
2021-08-21 05:34:13NOT_FOUND - home node 'rabbit@rabbit-services-00' of durable queue 'cert_collect_info' in vhost 'certificate' is down or inaccessible
On the RabbitMQ dashboard, I see this:
At the top of the dashboard, I see:
The two cluster nodes are in docker containers on two separate servers (services-00 and services-01). On the node that is reporting the problem, inside the container, I found the command rabbitmqctl cluster_status. It gives me:
root@rabbit-services-00:/# rabbitmqctl cluster_status
Cluster status of node 'rabbit@rabbit-services-00' ...
[{nodes,[{disc,['rabbit@rabbit-services-00','rabbit@rabbit-services-01']}]},
{running_nodes,['rabbit@rabbit-services-00']},
{cluster_name,<<"rabbit@rabbit-services-01">>},
{partitions,[{'rabbit@rabbit-services-00',['rabbit@rabbit-services-01']}]},
{alarms,[{'rabbit@rabbit-services-00',[]}]}]
But I am unsure of how to interpret it, at least in the short time I have to try and get this running. Any assistance would be greatly appreciated. I've been trying to work on this since 1am, and I've run out of options. I figured commenting out the problem node in the Nginx config would've helped, but it has not. The upstart services keep repeatedly dying when I restart.


I traced this to a switch blip (Thank you Nagios). I did the following from within the container:
docker exec rabbit-services-00 rabbitmqctl stop_appdocker exec rabbit-services-00 rabbitmqctl start_appThen I rechecked the queues, and they started working again. It appears it was just a short network outage.