Yarn autodetect slaves failure

66 views Asked by At

This is something that I've found nowhere.

I have a YARN cluster with some slaves. When a slave fails (chaos monkey, scale down, etc.), ResourceManager doesn't "get it". Even a rmadmin -refreshNodes doesn't fix it. ResourceManager keeps listing the failed nodes as RUNNING. How do I do in order for ResourceManager to check for slaves health and remove them when they fail?

1

There are 1 answers

0
Ramzy On

Please look into Hadoop Definitive Guide, Chapter 10, Maintenance, Commissioning and Decommissioning Nodes. Looks like you are trying to update the jobtracker with the above command. More elaborate process is mentioned there, along with updating the name node, verifying the progress in web UI, and removing the nodes from include file and slave file