We have HDP production Hadoop cluster , include 2 active / standby resource managers services
Some details from our cluster
HDP version - 2.6.5
OS Linux machines version - 7.9
Number of node manager / data node machines - - 487
From the RM logs , we saw that resource manager have connectivity problem with the zookeeper servers ,
After drilling to this problem we saw the following from zookeeper cli
[zk: localhost:2181(CONNECTED) 19] ls /rmstore/ZKRMStateRoot/RMAppRoot
[application_1700036808155_0199, application_1700036808155_0198, application_1700036808155_0195, application_1700036808155_0194, application_1700036808155_0197, application_1700036808155_0196, application_1700036808155_0191, application_1700036808155_0190, application_1700036808155_0193, application_1700036808155_0192, application_1699507654048_0002, application_1699507654048_0001, application_1700036808155_0126, application_1700036808155_0125, application_1700036808155_0128, application_1700036808155_0127, application_1700036808155_0122, application_1700036808155_0121, application_1700036808155_0124, application_1700036808155_0123, application_1698104640063_4149, application_1698104640063_4147, application_1698104640063_4148, application_1700036808155_0129, application_1698104640063_4145, application_1698104640063_4146, application_1698104640063_4154, application_1698104640063_4155, application_1698104640063_4152, application_1698104640063_4153, application_1698104640063_4150, application_1698104640063_4151, application_1700036808155_0120, application_1700036808155_0115, application_1700036808155_0114, application_1700036808155_0117, application_1700036808155_0116, application_1700036808155_0111, application_1700036808155_0110, application_1700036808155_0113, application_1700036808155_0112, application_1698104640063_4158, application_1700036808155_0119, application_1698104640063_4159, application_1700036808155_0118, application_1698104640063_4156, application_1698104640063_4157, application_1698104640063_4165, application_1698104640063_4166, application_1698104640063_4163, application_1698104640063_4164, application_1698104640063_4161, application_1698104640063_4162, application_1698104640063_4160, application_1700036808155_0148, application_1700036808155_0147, application_1700036808155_0149, application_1700036808155_0144, application_1700036808155_0143, application_1700036808155_0146, application_1700036808155_0145, application_1698104640063_4129, application_1698104640063_4127, application_1698104640063_4128, application_1698104640063_4125, application_1698104640063_4126, application_1698104640063_4123, application_1698104640063_4124, application_1698104640063_4132, application_1698104640063_4133, application_1698104640063_4130, application_1698104640063_4131, application_1700036808155_0140, application_1700036808155_0142, application_1700036808155_0141, application_1700036808155_0137, application_1700036808155_0136, application_1700036808155_0139, application_1700036808155_0138, application_1700036808155_0133, application_1700036808155_0132, application_1700036808155_0135, application_1700036808155_0134, application_1698104640063_4138, application_1698104640063_4139, application_1698104640063_4136, application_1698104640063_4137, application_1698104640063_4134, application_1698104640063_4135, application_1698104640063_4143, application_1698104640063_4144, application_1698104640063_4141, application_1698104640063_4142, application_1698104640063_4140, application_1700036808155_0131, application_1700036808155_0130, .......
and when we used the stat from zookeeper we found that
[zk: localhost:2181(CONNECTED) 20] stat /rmstore/ZKRMStateRoot/RMAppRoot
cZxid = 0x10000006b
ctime = Mon Jan 18 20:03:47 UTC 2021
mZxid = 0x10000006b
mtime = Mon Jan 18 20:03:47 UTC 2021
pZxid = 0x44f00082a60
cversion = 1916163
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 10009 <==each of these znodes have also children related to app attempts
first question
from this bad situation is why zookeeper not clean or purge the old data? or the old data according to timestamp or maybe the old RM application id's
my feeling is when we have huge data under /rmstore/ZKRMStateRoot/RMAppRoot , then RM high availability cluster can't read the data under RMAppRoot zookeeper folder
appreciate to get ideas how to clean zookeeper old data or what to set in zookeeper configuration in order to drop/purge/delete the old data that are not in use any more
second question:
what are the consequences if I delete all znodes under /rmstore/ZKRMStateRoot/RMAppRoot/
, and is it right to do this deletion without affected YARN resource manager functionality
[zk: localhost:2181(CONNECTED) 10] rmr /rmstore/ZKRMStateRoot/RMAppRoot/*
maybe other related doc