net_adm:ping failure very strange

765 views Asked by At

Dears,

I am getting an issue regards to Erlang cluster. After a long time my cluster working, one day, I can't make any connection more to a specific node ([email protected]) in the cluster, net_adm:ping([email protected]) returns a pang answer. Even using:

erlang -name [email protected] -setcookie MYCOOKIE -remsh [email protected]

return a failure result too.

The strange is, the [email protected] is working well to other nodes in the cluster. The problem just has happened when a new node joining to the cluster and ping to SickNode.

There isn't any firewall here because all nodes are working well within the cluster. Is there anybody has got this bad situation? Erlang is not stable for cluster using?

PS: I am using Erlang/OTP 20 with Centos 6.8

Many Thanks!!!

1

There are 1 answers

9
Brujo Benavides On

Not a straight up answer, but a theory and a way to reproduce your issue. It's complicated because it involves multiple nodes, but let's see if you can follow me.

TL;DR: [email protected] changed its cookie after it was connected to the cluster.

So, this is what I did… First, on a terminal I started node1 with cookie x

$ erl -name node1 -setcookie x
([email protected])1> 

Then, on another terminal I started node2 with cookie x, connected it to node1 and changed its cookie to y

$ erl -name node2 -setcookie x
([email protected])1> net_adm:ping('[email protected]').
pong
([email protected])2> erlang:set_cookie(node(), 'y').
true
([email protected])3>

Then, in yet another terminal I started node3with cookie x and pinged node1 (which resulted in a connection attempt to node2 as well, as you will see below) and then explicitely tried to connect to node2

$ erl -name node3 -setcookie x
([email protected])1> net_adm:ping('[email protected]').
pong
([email protected])2>
=WARNING REPORT==== 21-Nov-2018::15:09:07 ===
global: '[email protected]' failed to connect to '[email protected]'

=ERROR REPORT==== 21-Nov-2018::15:09:26 ===
** Connection attempt from disallowed node '[email protected]' **
([email protected])2> net_adm:ping('[email protected]').
pang

What happened so far? Well, since node1's cookie was x and node3's cookie was x as well, they could connect. node2 was still connected to node1 but, since the cookie there was y, node3 could not connect to it.

Erlang tries to establish a fully connected mesh of nodes, so when you connect to one of them, it automatically tries to connect you to all the others.

But I wanted to be thorough so I pinged node2 from node3 and, as expected I got a pang. Also, these messages popped up on node2:

([email protected])3>
=ERROR REPORT==== 21-Nov-2018::15:09:07 ===
** Connection attempt from disallowed node '[email protected]' **

=WARNING REPORT==== 21-Nov-2018::15:09:07 ===
global: '[email protected]' failed to connect to '[email protected]'

And, of course, when I tried to ping node3 from node2

([email protected])3> net_adm:ping('[email protected]').
pang

But… if I try to ping node1

([email protected])4> net_adm:ping('[email protected]').
pong

That's because they're already connected and Erlang only validates the sharing of the cookie on the initial handshake.

Finally, if I try to ping nodes from node1, I get the expected results…

([email protected])1> net_adm:ping('[email protected]').
pong
([email protected])2> net_adm:ping('[email protected]').
pong
([email protected])3>

Hope this helps.