When I try to write my RDD to a text file on HDFS as shown below, I am getting an error.
val rdd = sc.textFile("/user/hadoop/dxld801/test.txt")
val filtered = rdd.map({line=> line.replace("\\N","NULL")})
filtered.saveAsTextFile("hdfs:///user/hadoop/dxld801/test.txt")
Error:
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.mapred.DirectFileOutputCommitter not found
I am running all the above in a spark-shell and my spark version is 1.4.0
This is the command I am using to launch the shell
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.2.0 --jars /home/hadoop/lib/native/hadoop-lzo-0.4.14.jar
I have tried googling to get where this class “DirectFileOutputCommitter
” is available, but looks like this class doesn’t exist at all in this world.
Trace:
java.lang.RuntimeException: java.lang.RuntimeException:
> java.lang.ClassNotFoundException: Class
> org.apache.hadoop.mapred.DirectFileOutputCommitter not found
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1927)
> at org.apache.hadoop.mapred.JobConf.getOutputCommitter(JobConf.java:722)
> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:983)
> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965)
> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
> at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:965)
> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:897)
> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:897)
> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:897)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
> at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:896)
> at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1400)
> at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1379)
> at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1379)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1379)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:26)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:37)
> at $iwC$$iwC$$iwC.<init>(<console>:39)
> at $iwC$$iwC.<init>(<console>:41)
> at $iwC.<init>(<console>:43)
> at <init>(<console>:45)
> at .<init>(<console>:49)
> at .<clinit>(<console>)
> at .<init>(<console>:7)
> at .<clinit>(<console>)
> at $print(<console>)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
> at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.RuntimeException:
> java.lang.ClassNotFoundException: Class
> org.apache.hadoop.mapred.DirectFileOutputCommitter not found
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1895)
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1919)
> ... 68 more Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.mapred.DirectFileOutputCommitter not found
> at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801)
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1893)
> ... 69 more
Could anyone help me resolve this?
It looks like the path to save location
"hdfs:///user/hadoop/dxld801/test.txt"
is wrong. It should be:"/user/hadoop/dxld801/test.txt"
or"hdfs://HOSTNAME:PORT/user/hadoop/dxld801/test.txt"
Also fix the line:
rdd.map({line=> line.replace("\\N","NULL")})
tordd.map{line=> line.replace("\\N","NULL")}