Cannot make hadoop HDFS data persist with docker

191 views Asked by At

I have a namenode and a datanode that are created using this docker-compos.yaml file

version: "3"
services:
   namenode:
      image: apache/hadoop:3
      hostname: 192.168.105.139
      command: ["hdfs", "namenode"]
      # ports:
      #   - 8020:8020
      #   - 9000:9000
      #   - 9870:9870
      network_mode: host
      env_file:
        - ./config
      environment:
          ENSURE_NAMENODE_DIR: "/tmp/hadoop-root/dfs/name"
      volumes:
       - ./hadoop/name:/tmp/hadoop-hadoop/dfs/name
   datanode:
      image: apache/hadoop:3
      command: ["hdfs", "datanode"]
      hostname: 192.168.105.139
      # ports:
      #   - 9864:9864
      #   - 9866:9866
      network_mode: host
      env_file:
        - ./config
      volumes:
       - ./hadoop/data:/tmp/hadoop-hadoop/dfs/data
      depends_on:
       - namenode

And this conf file

HADOOP_HOME=/opt/hadoop
# CORE-SITE.XML_hadoop.tmp.dir=/opt/hadoop/data/
CORE-SITE.XML_fs.defaultFS=hdfs://192.168.105.139:9000
CORE-SITE.XML_hadoop.http.staticuser.user=hadoop
CORE-SITE.XML_hadoop_http_cross-origin_allowed-origins=*
CORE-SITE.XML_hadoop_http_cross-origin_allowed-methods=GET,POST,HEAD,DELETE,OPTIONS
CORE-SITE.XML_hadoop_http_cross-origin_allowed-headers=X-Requested-With,Content-Type,Accept,Origin
CORE-SITE.XML_hadoop_http_cross-origin_max-age=1800
CORE-SITE.XML_hadoop.http.cross-origin.enabled=true
# HDFS-SITE.XML_dfs.namenode.support.allow.format=false
# HDFS-SITE.XML_dfs.replication=1
# HDFS-SITE.XML_dfs.namenode.name.dir.restore=true
MAPRED-SITE.XML_mapreduce.framework.name=yarn
MAPRED-SITE.XML_yarn.app.mapreduce.am.env=HADOOP_MAPRED_HOME=$HADOOP_HOME
MAPRED-SITE.XML_mapreduce.map.env=HADOOP_MAPRED_HOME=$HADOOP_HOME
MAPRED-SITE.XML_mapreduce.reduce.env=HADOOP_MAPRED_HOME=$HADOOP_HOME
YARN-SITE.XML_yarn.resourcemanager.hostname=resourcemanager
YARN-SITE.XML_yarn.nodemanager.pmem-check-enabled=false
YARN-SITE.XML_yarn.nodemanager.delete.debug-delay-sec=600
YARN-SITE.XML_yarn.nodemanager.vmem-check-enabled=false
YARN-SITE.XML_yarn.nodemanager.aux-services=mapreduce_shuffle
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.maximum-applications=10000
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.maximum-am-resource-percent=0.1
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.queues=default
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.capacity=100
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.user-limit-factor=1
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.maximum-capacity=100
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.state=RUNNING
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.acl_submit_applications=*
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.acl_administer_queue=*
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.node-locality-delay=40
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.queue-mappings=
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.queue-mappings-override.enable=false

I want my data to be persist so i mount, as above, the data to host machine. On start the version files are created that contains the CLUSTERID. when i "down" the cluster and rerun it i came across with this error java.io.IOException: Incompatible clusterIDs in /tmp/hadoop-hadoop/dfs/data: namenode clusterID = CID-846f2da4-7fad-40c7-891b-97ac6653031a; datanode clusterID = CID-c9ba5304-0c5b-4564-83fe-aaf6c2e3e019 on datanode.

2

There are 2 answers

0
OneCricketeer On

The namenode container will format itself and generate a unique ID each time it starts. You'd have to override the entrypoint script to prevent this.

Otherwise, look into using Hadoop Ozone or MinIO images rather than HDFS for Hadoop-compatible persistence.

0
Xaime Pardal On

I have tried adding this parameter in the hadoop configuration, but the data is persisted in docker correctly.

HDFS-SITE.XML_dfs.clusterID=71ec2cce-3322-417d-85de-2b40ddb7a7ed

The solution may be to define these directories correctly, but I cannot do it due to errors in permissions.

HDFS-SITE.XML_dfs.namenode.name.dir=/root/hadoop/hdfs/namenode HDFS-SITE.XML_dfs.datanode.data.dir=/root/hadoop/hdfs/datanode