This is something that I've found nowhere.
I have a YARN cluster with some slaves. When a slave fails (chaos monkey, scale down, etc.), ResourceManager doesn't "get it". Even a rmadmin -refreshNodes
doesn't fix it. ResourceManager keeps listing the failed nodes as RUNNING
. How do I do in order for ResourceManager to check for slaves health and remove them when they fail?
Please look into Hadoop Definitive Guide, Chapter 10, Maintenance, Commissioning and Decommissioning Nodes. Looks like you are trying to update the jobtracker with the above command. More elaborate process is mentioned there, along with updating the name node, verifying the progress in web UI, and removing the nodes from include file and slave file