Riak services not starting on Master server

51 views Asked by At

Riak and Stanchion service all of a sudden stopped working in production with following errors in console log.

2023-11-07 03:41:14.547 [info] <0.160.0>@riak_core_app:stop:114 Stopped application riak_core. 2023-11-07 03:41:14.547 [error] <0.164.0> Supervisor riak_core_vnode_sup had child undefined started with riak_core_vnode:start_link() at undefined exit with reason killed in context shutdown_error 2023-11-07 03:41:14.547 [error] <0.164.0> Supervisor riak_core_vnode_sup had child undefined started with riak_core_vnode:start_link() at undefined exit with reason bad argument in call to ets:lookup(ets_riak_kv_entropy, {index,{riak_kv,1233142006497949337234359077604363797834693083136}}) in riak_kv_entropy_info:update_index_info/2 line 161 in context shutdown_error 2023-11-07 03:41:14.547 [error] <0.164.0> Supervisor riak_core_vnode_sup had child undefined started with riak_core_vnode:start_link() at undefined exit with reason bad argument in call to ets:lookup(ets_riak_kv_entropy, {index,{riak_kv,1187470080331358621040493926581979953470445191168}}) in riak_kv_entropy_info:update_index_info/2 line 161 in context shutdown_error

We have tried restarting services or even deleting all files from /../../ring folder and rejoing the cluster but no luck.

Please someone help use to resolve this??

We have tried restarting services or even deleting all files from /../../ring folder and rejoing the cluster but no luck.

1

There are 1 answers

0
Nicholas Adams On

That's an interesting question. As Stanchion is a standalone piece of software, it should still work providing other nodes in your cluster continue to be functional. Incidentally, if you use Riak CS 3.1 or later (latest version at time of writing being Riak CS 3.2.2) then Stanchion is auto-managed from within Riak CS, so, in the rare event of Stanchion going down (commonly hardware failure), Riak CS will notice this and spawn a new Stanchion instance on a different node.

Regarding the error message you are seeing, it looks like Riak KV is having issues with AAE. As the AAE hashtrees are routinely destroyed and rebuilt by Riak, it is safe to delete them as Riak will simply rebuild them later. If you go to your data directory, (usually /var/lib/riak unless you specified a different location in /etc/riak/riak.conf), there should be a folder called "anti_entropy". You can safely delete the contents of all the folders inside with the following:

cd /var/lib/riak
for i in $(ls anti_entropy); do rm -rf anti_entropy/$i/*; done

The above could also be done from within the anti_entropy directory but that would then risk deleting things in the wrong folder if there was a typo in the original "cd" command.

Once you have deleted the hashtrees reboot the server to clear anything that might be left over in memory and try starting Riak KV again.

Given the limitation of Stack Overflow single answers rather than the luxury of our normal Riak support ticket system, I'm including the below as a last resort.

Should the above still fail, Riak, by default, stores 3 copies of all data (n_val=3). In a worst case scenario, you could wipe this node completely, re-install it, re-add it to the cluster and then perform an all-partition repair on it. You could, optionally, set this up as a different node i.e. different IP and nodename and force replace the now defunct node. That would be a cluster join, a force replace and then a cluster commit.

Commonly, a force replace would be followed by an all-partition repair on the node that joined as the cluster commit will only give the partition handles but not the actual content of the partitions. Normally, this is slowly copied over by AAE and read repair but that may take a while to populate. However, an all-partition repair will cause all the data to be populated far more quickly at the cost of a temporary increase in resource usage.