I am trying to launch a map reduce job in amazon map reduce cluster. My map reduce job does some pre-processing before generating map/reduce tasks. This pre-processing requires third party libs such as javacv, opencv. Following the amazon's documentation, I have included those libraries in HADOOP_CLASSPATH such that I have a line HADOOP_CLASSPATH= in hadoop-user-env.sh in the location /home/hadoop/conf/ of master node. According to the documentation, the entry in this script should be included in hadoop-env.sh. Hence, I assumed that HADOOP_CLASSPATH now has my libs in the classpath. I did this in bootstrap actions. However, when i launch the job, it still complains class not found exception pointing to a class in the jar which is supposed to be in the classpath. Can someone tell me where I am going wrong? bbtw, i am using hadoop 2.2.0. In my local infrastructure, i have a small bash script that exports HADOOP_CLASSPATH with all the libs included in it and calls hadoop jar -libjars .
Lauching a map reduce job in amazon elastic map reduce
552 views Asked by Bala AtThere are 2 answers
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                I solved this with an AWS EMR bootstrap task to add a jar to the hadoop classpath:
- Uploaded my jar to S3
 Created a bootstrap script to copy the jar from S3 to the EMR instance and add the jar to the classpath:
#!/bin/bash hadoop fs -copyToLocal s3://my-bucket/libthrift-0.9.2.jar /home/hadoop/lib/ echo 'export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:/home/hadoop/lib/libthrift-0.9.2.jar"' >> /home/hadoop/conf/hadoop-user-env.shSaved that script as "add-jar-to-hadoop-classpath.sh" and uploaded it to S3.
- My "aws emr create-cluster" command adds the bootstrap script with this argument: --bootstrap-actions Path=s3://my-bucket/add-jar-to-hadoop-classpath.sh
 
When the EMR spins up the instance will have the file /home/hadoop/conf/hadoop-user-env.sh created and my MR job was able to instantiate the thrift classes in the jar.
UPDATE : I was able to instantiate thrift classes from the MASTER node, but not from the CORE node. I sshed into the CORE node and the lib was properly copied to /home/hadoop/lib and my HADOOP_CLASSPATH setting was there, but I was still getting class not found at runtime when the mapper tried to use thrift.
Solution ended up being to the the maven-shade-plugin and embed the thrift jar:
        <plugin>
            <!-- Use the maven shade plugin to embed the thrift classes in our jar.
              Couldn't get the HADOOP_CLASSPATH on AWS EMR to load these classes even
              with the jar copied to /home/hadoop/lib and the proper env var in
              /home/hadoop/conf/hadoop-user-env.sh -->
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>2.3</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <artifactSet>
                            <includes>
                                <include>org.apache.thrift:libthrift</include>
                            </includes>
                        </artifactSet>
                    </configuration>
                </execution>
            </executions>
        </plugin>
                        
When your job is executed, the "controller" logfile contains the actually executed commandline. This could look something like:
The log is located on the master node in /mnt/var/lib/hadoop/steps/ - it´s easily accessible when you SSH into the master node (requires specifying a key pair when creating the cluster).
I´ve never really worked with what´s in HADOOP_CLASSPATH, but if you define a bootstrap action to just copy your libraries into /home/hadoop/lib, that should solve the issue.