I am trying to launch a map reduce job in amazon map reduce cluster. My map reduce job does some pre-processing before generating map/reduce tasks. This pre-processing requires third party libs such as javacv, opencv. Following the amazon's documentation, I have included those libraries in HADOOP_CLASSPATH such that I have a line HADOOP_CLASSPATH= in hadoop-user-env.sh in the location /home/hadoop/conf/ of master node. According to the documentation, the entry in this script should be included in hadoop-env.sh. Hence, I assumed that HADOOP_CLASSPATH now has my libs in the classpath. I did this in bootstrap actions. However, when i launch the job, it still complains class not found exception pointing to a class in the jar which is supposed to be in the classpath. Can someone tell me where I am going wrong? bbtw, i am using hadoop 2.2.0. In my local infrastructure, i have a small bash script that exports HADOOP_CLASSPATH with all the libs included in it and calls hadoop jar -libjars .
Lauching a map reduce job in amazon elastic map reduce
513 views Asked by Bala AtThere are 2 answers
I solved this with an AWS EMR bootstrap task to add a jar to the hadoop classpath:
- Uploaded my jar to S3
Created a bootstrap script to copy the jar from S3 to the EMR instance and add the jar to the classpath:
#!/bin/bash hadoop fs -copyToLocal s3://my-bucket/libthrift-0.9.2.jar /home/hadoop/lib/ echo 'export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:/home/hadoop/lib/libthrift-0.9.2.jar"' >> /home/hadoop/conf/hadoop-user-env.sh
Saved that script as "add-jar-to-hadoop-classpath.sh" and uploaded it to S3.
- My "aws emr create-cluster" command adds the bootstrap script with this argument: --bootstrap-actions Path=s3://my-bucket/add-jar-to-hadoop-classpath.sh
When the EMR spins up the instance will have the file /home/hadoop/conf/hadoop-user-env.sh created and my MR job was able to instantiate the thrift classes in the jar.
UPDATE : I was able to instantiate thrift classes from the MASTER node, but not from the CORE node. I sshed into the CORE node and the lib was properly copied to /home/hadoop/lib and my HADOOP_CLASSPATH setting was there, but I was still getting class not found at runtime when the mapper tried to use thrift.
Solution ended up being to the the maven-shade-plugin and embed the thrift jar:
<plugin>
<!-- Use the maven shade plugin to embed the thrift classes in our jar.
Couldn't get the HADOOP_CLASSPATH on AWS EMR to load these classes even
with the jar copied to /home/hadoop/lib and the proper env var in
/home/hadoop/conf/hadoop-user-env.sh -->
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<artifactSet>
<includes>
<include>org.apache.thrift:libthrift</include>
</includes>
</artifactSet>
</configuration>
</execution>
</executions>
</plugin>
When your job is executed, the "controller" logfile contains the actually executed commandline. This could look something like:
The log is located on the master node in /mnt/var/lib/hadoop/steps/ - it´s easily accessible when you SSH into the master node (requires specifying a key pair when creating the cluster).
I´ve never really worked with what´s in HADOOP_CLASSPATH, but if you define a bootstrap action to just copy your libraries into /home/hadoop/lib, that should solve the issue.