YARN + yarn resource manager stores a ton of znodes related to running/old applications in zookeeper

72 views Asked by At

We have HDP production Hadoop cluster , include 2 active / standby resource managers services

Some details from our cluster

  1. HDP version - 2.6.5

  2. OS Linux machines version - 7.9

  3. Number of node manager / data node machines - - 487

From the RM logs , we saw that resource manager have connectivity problem with the zookeeper servers ,

After drilling to this problem we saw the following from zookeeper cli

[zk: localhost:2181(CONNECTED) 19] ls /rmstore/ZKRMStateRoot/RMAppRoot

[application_1700036808155_0199, application_1700036808155_0198, application_1700036808155_0195, application_1700036808155_0194, application_1700036808155_0197, application_1700036808155_0196, application_1700036808155_0191, application_1700036808155_0190, application_1700036808155_0193, application_1700036808155_0192, application_1699507654048_0002, application_1699507654048_0001, application_1700036808155_0126, application_1700036808155_0125, application_1700036808155_0128, application_1700036808155_0127, application_1700036808155_0122, application_1700036808155_0121, application_1700036808155_0124, application_1700036808155_0123, application_1698104640063_4149, application_1698104640063_4147, application_1698104640063_4148, application_1700036808155_0129, application_1698104640063_4145, application_1698104640063_4146, application_1698104640063_4154, application_1698104640063_4155, application_1698104640063_4152, application_1698104640063_4153, application_1698104640063_4150, application_1698104640063_4151, application_1700036808155_0120, application_1700036808155_0115, application_1700036808155_0114, application_1700036808155_0117, application_1700036808155_0116, application_1700036808155_0111, application_1700036808155_0110, application_1700036808155_0113, application_1700036808155_0112, application_1698104640063_4158, application_1700036808155_0119, application_1698104640063_4159, application_1700036808155_0118, application_1698104640063_4156, application_1698104640063_4157, application_1698104640063_4165, application_1698104640063_4166, application_1698104640063_4163, application_1698104640063_4164, application_1698104640063_4161, application_1698104640063_4162, application_1698104640063_4160, application_1700036808155_0148, application_1700036808155_0147, application_1700036808155_0149, application_1700036808155_0144, application_1700036808155_0143, application_1700036808155_0146, application_1700036808155_0145, application_1698104640063_4129, application_1698104640063_4127, application_1698104640063_4128, application_1698104640063_4125, application_1698104640063_4126, application_1698104640063_4123, application_1698104640063_4124, application_1698104640063_4132, application_1698104640063_4133, application_1698104640063_4130, application_1698104640063_4131, application_1700036808155_0140, application_1700036808155_0142, application_1700036808155_0141, application_1700036808155_0137, application_1700036808155_0136, application_1700036808155_0139, application_1700036808155_0138, application_1700036808155_0133, application_1700036808155_0132, application_1700036808155_0135, application_1700036808155_0134, application_1698104640063_4138, application_1698104640063_4139, application_1698104640063_4136, application_1698104640063_4137, application_1698104640063_4134, application_1698104640063_4135, application_1698104640063_4143, application_1698104640063_4144, application_1698104640063_4141, application_1698104640063_4142, application_1698104640063_4140, application_1700036808155_0131, application_1700036808155_0130, .......

and when we used the stat from zookeeper we found that

[zk: localhost:2181(CONNECTED) 20] stat /rmstore/ZKRMStateRoot/RMAppRoot
cZxid = 0x10000006b
ctime = Mon Jan 18 20:03:47 UTC 2021
mZxid = 0x10000006b
mtime = Mon Jan 18 20:03:47 UTC 2021
pZxid = 0x44f00082a60
cversion = 1916163
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 10009  <==each of these znodes have also children related to app attempts

first question

from this bad situation is why zookeeper not clean or purge the old data? or the old data according to timestamp or maybe the old RM application id's

my feeling is when we have huge data under /rmstore/ZKRMStateRoot/RMAppRoot , then RM high availability cluster can't read the data under RMAppRoot zookeeper folder

appreciate to get ideas how to clean zookeeper old data or what to set in zookeeper configuration in order to drop/purge/delete the old data that are not in use any more

second question:

what are the consequences if I delete all znodes under /rmstore/ZKRMStateRoot/RMAppRoot/ , and is it right to do this deletion without affected YARN resource manager functionality

[zk: localhost:2181(CONNECTED) 10] rmr /rmstore/ZKRMStateRoot/RMAppRoot/*

maybe other related doc

https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_yarn-resource-management/content/ref-c2ececdf-c68e-4095-99b5-15b4c31701ba.1.html

https://community.cloudera.com/t5/Support-Questions/How-To-Best-Resolve-RMStateStore-FENCED/td-p/96032

https://blog.csdn.net/qq_42264264/article/details/130827532

0

There are 0 answers