I'm using Apache Druid with a containerized deployment of HDFS in my testbed. After running stably for 5 days, I see one of the HDFS workers is reported as dead on the HDFS UI. Inside the container of this 'dead' worker, I see the process is still alive but there are thousands of TCP connections in the CLOSE_WAIT state. I see quite a few issues been filed on the HDFS JIRA page against different versions of HDFS.
HDFS version: 2.7.5.
Container ulimit: Max of a 1048576 files.
Druid is the only component that's interfacing with HDFS. There's no custom code been written that would be failing to call a close().
Has anyone seen a similar issue and worked around it?