I have the below spark word count program :
package com.sample.spark;
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFlatMapFunction;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import scala.Tuple2;
public class SparkWordCount {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("wordcountspark").setMaster("local").setSparkHome("/Users/hadoop/spark-1.4.0-bin-hadoop1");
JavaSparkContext sc = new JavaSparkContext(conf);
//SparkConf conf = new SparkConf();
//JavaSparkContext sc = new JavaSparkContext("hdfs", "Simple App","/Users/hadoop/spark-1.4.0-bin-hadoop1", new String[]{"target/simple-project-1.0.jar"});
JavaRDD<String> textFile = sc.textFile("hdfs://localhost:54310/data/wordcount");
JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
});
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
});
counts.saveAsTextFile("hdfs://localhost:54310/data/output/spark/outfile");
}
}
I get the Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.CanSetDropBehind exception when I run the code from ecllipse however if I export as runnable jar and run from the terminal as below it works :
bin/spark-submit --class com.sample.spark.SparkWordCount --master local /Users/hadoop/spark-1.4.0-bin-hadoop1/finalJars/SparkJar-v2.jar
The maven pom looks like :
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.sample.spark</groupId>
<artifactId>SparkRags</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>SparkRags</name>
<url>http://maven.apache.org</url>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.4.0</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>0.23.11</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
<scope>compile</scope>
</dependency>
</dependencies>
</project>
When you run in eclipse, the referenced jars are the only source for your program to run. So the jar hadoop-core(thats where CanSetDropBehind is present), is not added properly in your eclipse from local repository for some reasons. You need to identify this if it is a proxy issue, or any other with pom.
When you run the jar from terminal, the reason for running can be, due to the presence of jar in the classpath referenced. Also while running from terminal, you could also choose to have those jars as fat jar(to include hadoop-core) in your jar. I hope you are not using this option while creating your jar. Then the reference would be picked from inside your jar, without depending on the class path.
Verify each step, and it will help you identify the cause. Happy coding