Lauching a map reduce job in amazon elastic map reduce

513 views Asked by At

I am trying to launch a map reduce job in amazon map reduce cluster. My map reduce job does some pre-processing before generating map/reduce tasks. This pre-processing requires third party libs such as javacv, opencv. Following the amazon's documentation, I have included those libraries in HADOOP_CLASSPATH such that I have a line HADOOP_CLASSPATH= in hadoop-user-env.sh in the location /home/hadoop/conf/ of master node. According to the documentation, the entry in this script should be included in hadoop-env.sh. Hence, I assumed that HADOOP_CLASSPATH now has my libs in the classpath. I did this in bootstrap actions. However, when i launch the job, it still complains class not found exception pointing to a class in the jar which is supposed to be in the classpath. Can someone tell me where I am going wrong? bbtw, i am using hadoop 2.2.0. In my local infrastructure, i have a small bash script that exports HADOOP_CLASSPATH with all the libs included in it and calls hadoop jar -libjars .

2

There are 2 answers

0
Michael Langowski On BEST ANSWER

When your job is executed, the "controller" logfile contains the actually executed commandline. This could look something like:

2014-06-02T15:37:47.863Z INFO Fetching jar file.
2014-06-02T15:37:54.943Z INFO Working dir /mnt/var/lib/hadoop/steps/13
2014-06-02T15:37:54.944Z INFO Executing /usr/java/latest/bin/java -cp /home/hadoop/conf:/usr/java/latest/lib/tools.jar:/home/hadoop:/home/hadoop/hadoop-tools.jar:/home/hadoop/hadoop-tools-1.0.3.jar:/home/hadoop/hadoop-core-1.0.3.jar:/home/hadoop/hadoop-core.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/13 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/13/tmp -Djava.library.path=/home/hadoop/native/Linux-amd64-64 org.apache.hadoop.util.RunJar <YOUR_JAR> <YOUR_ARGS>

The log is located on the master node in /mnt/var/lib/hadoop/steps/ - it´s easily accessible when you SSH into the master node (requires specifying a key pair when creating the cluster).

I´ve never really worked with what´s in HADOOP_CLASSPATH, but if you define a bootstrap action to just copy your libraries into /home/hadoop/lib, that should solve the issue.

0
Bobby Carp On

I solved this with an AWS EMR bootstrap task to add a jar to the hadoop classpath:

  1. Uploaded my jar to S3
  2. Created a bootstrap script to copy the jar from S3 to the EMR instance and add the jar to the classpath:

    #!/bin/bash
    hadoop fs -copyToLocal s3://my-bucket/libthrift-0.9.2.jar /home/hadoop/lib/
    echo 'export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:/home/hadoop/lib/libthrift-0.9.2.jar"' >> /home/hadoop/conf/hadoop-user-env.sh
    
  3. Saved that script as "add-jar-to-hadoop-classpath.sh" and uploaded it to S3.

  4. My "aws emr create-cluster" command adds the bootstrap script with this argument: --bootstrap-actions Path=s3://my-bucket/add-jar-to-hadoop-classpath.sh

When the EMR spins up the instance will have the file /home/hadoop/conf/hadoop-user-env.sh created and my MR job was able to instantiate the thrift classes in the jar.

UPDATE : I was able to instantiate thrift classes from the MASTER node, but not from the CORE node. I sshed into the CORE node and the lib was properly copied to /home/hadoop/lib and my HADOOP_CLASSPATH setting was there, but I was still getting class not found at runtime when the mapper tried to use thrift.

Solution ended up being to the the maven-shade-plugin and embed the thrift jar:

        <plugin>
            <!-- Use the maven shade plugin to embed the thrift classes in our jar.
              Couldn't get the HADOOP_CLASSPATH on AWS EMR to load these classes even
              with the jar copied to /home/hadoop/lib and the proper env var in
              /home/hadoop/conf/hadoop-user-env.sh -->
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>2.3</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <artifactSet>
                            <includes>
                                <include>org.apache.thrift:libthrift</include>
                            </includes>
                        </artifactSet>
                    </configuration>
                </execution>
            </executions>
        </plugin>