I have a 16-node Cassandra cluster (3.11.9) with 3 seed nodes (.54, .115 and 164), replication factor 3 and gc_grace_seconds 10 days(default). Some nodes show DN on some other nodes but on other nodes they show up as UN. For example below is the nodetool status from the .54 and the .115 nodes:
while for example on .87 every node is UN. This is happening for at least a couple of weeks now, and it started from two nodes that were showing each other down, the .54 and .147. However, it seems it expanded right now more and more nodes show DN on some nodes(but not on all). Just to also add that there were no writes these weeks.
I have tried enabling, disabling the gossip and restarting cassandra on all nodes. Generation stamp is up to date in system_auth table. I can connect to these nodes with cqlsh but, as expected, in some cases I get NoHostAvailable because some data are located on the "dead" nodes.
nodetool describecluster shows the DN nodes to be Unreachable, depending on which node I am executing it. So i.e. the .54 shows the .164,.115,.147 and .19 as Unreachable.
Also in nodetool gossipinfo everything looks ok with status: normal and up-to-date generation.
In the debug.log file I only get:
DEBUG [MessagingService-Outgoing-/192.168.100.147-Gossip] 2022-05-30 03:58:43,478 OutboundTcpConnection.java:546 - Unable to connect to /192.168.100.147
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.Net.connect0(Native Method) ~[na:1.8.0_312]
at sun.nio.ch.Net.connect(Net.java:482) ~[na:1.8.0_312]
at sun.nio.ch.Net.connect(Net.java:474) ~[na:1.8.0_312]
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:647) ~[na:1.8.0_312]
at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:146) ~[apache-cassandra-3.11.9.jar:3.11.9]
at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:132) ~[apache-cassandra-3.11.9.jar:3.11.9]
at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:434) [apache-cassandra-3.11.9.jar:3.11.9]
at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:262) [apache-cassandra-3.11.9.jar:3.11.9]
- In the system.log it actually has logged for all the nodes as "Node...has restarted, now UP" and "Node...state jump to Normal". However I also noticed this, which may has nothing to do:
WARN [GossipStage:1] 2022-05-30 07:22:06,164 Gossiper.java:1693 - \
Received an ack from /192.168.100.127, who isn't a seed.
Ensure your seed list includes a live node. Exiting shadow round
Is there any way to understand what is happening and why is this happening? Do I miss something?
Please let me know if you need any more information.
This looks like a classic networking issue to me where the nodes are unable to gossip with each other because there's no connectivity on the internode port (default is
7000
). The debug message you posted clearly states the cause:You need to check that there are no firewalls like
iptables
orfirewalld
blocking the traffic on port7000
, otherwise the nodes can't talk to each other.It is simple enough to test it using Linux tools such as
telnet
ornc
. For example, run this command on node.54
:If you get a "connection refused" error, it means that one of the following is true:
7000
is blocked, orstorage_port
incassandra.yaml
But in my experience, the most likely cause is that traffic is blocked by a firewall. Cheers!