I have developed a map-reduce program using Apache Hadoop 1.2.1. I did the initial development using the Eclipse IDE to simulate the hadoop distributed computing environment with all the input and output files coming from my local file system. This program will execute in Eclipse with no issues. I then create a JAR file using Eclipse and attempt to run this on my cluster-of-one hadoop machine and receive errors:
Here's my code to set up and run the hadoop job:
String outputPath = "/output";
String hadoopInstructionsPath = args[0];
Job job = new Job();
job.setJarByClass(Main.class); //setJarByClass is here but not found apparently?!?
job.setJobName("KLSH");
FileInputFormat.addInputPath(job, new Path(hadoopInstructionsPath));
FileOutputFormat.setOutputPath(job,new Path(outputPath));
job.setMapperClass(KLSHMapper.class);
job.setReducerClass(KLSHReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0:1);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
I then create a jar using eclipse using File -> Export -> Runnable JAR file to create the JAR file to run on the cluster.
The command I use to run the job is as follows (KLSH.jar is the name of the JAR file, /hadoopInstruction is the args[0] input parameter, and imageFeature.Main/ specifies where the main class is)
./hadoop jar ./KLSH.jar /hadoopInstructions imageFeatures.Main/
This produces the following output:
14/11/12 11:11:48 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/11/12 11:11:48 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
14/11/12 11:11:48 INFO input.FileInputFormat: Total input paths to process : 1
14/11/12 11:11:48 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/11/12 11:11:48 WARN snappy.LoadSnappy: Snappy native library not loaded
14/11/12 11:11:49 INFO mapred.JobClient: Running job: job_201411051030_0022
14/11/12 11:11:50 INFO mapred.JobClient: map 0% reduce 0%
14/11/12 11:11:56 INFO mapred.JobClient: Task Id : attempt_201411051030_0022_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: imageFeatures.KLSHMapper
...
So it errors out because it fails to find the mapper class. There is the "No job jar file set" warning, but I feel like I have specified the job.setJarByClass in the first block of code, so I don't know why this error is being thrown...
I also know the KLSHMapper class is in the JAR because if I run the following command:
jar tf KLSH.jar
I get quite a lot of output, but here's a portion of the output:
...
imageFeatures/Main.class
imageFeatures/Feature.class
imageFeatures/FileLoader.class
imageFeatures/KLSHMapper.class
...
So clearly the KLSHMapper class is in there... I've tried modifying my hadoop classpath to include the KLSH.jar path, I've tried copying the KLSH.jar onto the DFS and trying to use that path instead of the path on my local file system, and I've also tried executing the job with a -libjars specifier. No matter what I seem to try, hadoop seems to be unable to locate my Mapper class. Could anyone out there point me towards what I'm doing wrong? I just can't seem to make the jump from my code working in Eclipse to making it work on an actual Hadoop cluster. Thanks!
After some additional work, I was able to solve my own problem. Ultimately, it came down to the way I was building the jar file which I was then trying to execute on the hadoop cluster.
Instead of using eclipse to build the JAR file, I used Maven from the command line to build the JAR file. In the pom.xml file, you can specify the main class by using something along these lines:
In my case, maxTemp was the package and MaxTemperature was the class that contained my main method. This will cause the manifest file which is contained within the JAR that maven builds for you to have the following line added to it:
Now when you use hadoop to execute the jar file, you will no longer have to specify the mainClass, as you have already done so in the jar's manifest. Without this line in the manifest of the JAR, you would need to execute your job on your cluster using this syntax:
With the line in the manifest file, you can just execute the job as follows:
As an aside and somewhat related, I ran into some issues becasue I was using the jeigen java library to do some linear algebra. My cluster wasn't able to find the dependencies I had used (jeigen.jar), and it was throwing more errors. I ended up building a fat jar, as was described on this site:
http://hadoopi.wordpress.com/2014/06/05/hadoop-add-third-party-libraries-to-mapreduce-job/
With some additions to my pom.xml file, I was able to generate a maxTemp-jar-with-dependencies, and the cluster was then able to find all my dependencies becasue they were included with the jar file. I hope this helps save someone some time in the future. Some of these dependencies were on my local system, and Maven wasn't able to go and get them. I was able to point maven at them and install them manually by using the following command:
Here is my pom.xml file which generates two jars, one with and one without dependencies included: