Parsing json in spark

17.4k views Asked by At

I was using json scala library to parse a json from a local drive in spark job :

val requestJson=JSON.parseFull(Source.fromFile("c:/data/request.json").mkString)
    val mainJson=requestJson.get.asInstanceOf[Map[String,Any]].get("Request").get.asInstanceOf[Map[String,Any]]
    val currency=mainJson.get("currency").get.asInstanceOf[String]

But when i try to use the same parser by pointing to hdfs file location it doesnt work:

val requestJson=JSON.parseFull(Source.fromFile("hdfs://url/user/request.json").mkString)

and gives me an error:

java.io.FileNotFoundException: hdfs:/localhost/user/request.json (No such file or directory)
  at java.io.FileInputStream.open0(Native Method)
  at java.io.FileInputStream.open(FileInputStream.java:195)
  at java.io.FileInputStream.<init>(FileInputStream.java:138)
  at scala.io.Source$.fromFile(Source.scala:91)
  at scala.io.Source$.fromFile(Source.scala:76)
  at scala.io.Source$.fromFile(Source.scala:54)
  ... 128 elided

How can i use Json.parseFull library to get data from hdfs file location ?

Thanks

3

There are 3 answers

6
mrsrinivas On BEST ANSWER

Spark does have an inbuilt support for JSON documents parsing which will be available in spark-sql_${scala.version} jar.

In Spark 2.0+ :

import org.apache.spark.sql.SparkSession 

val spark: SparkSession = SparkSession.builder.master("local").getOrCreate

val df = spark.read.format("json").json("json/file/location/in/hdfs")

df.show()

with df object you can do all supported SQL operations on it and it's data processing will be distributed among the nodes whereas requestJson will be computed in single machine only.

Maven dependencies

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.0.0</version>
</dependency>

Edit: (as per comment to read file from hdfs)

val hdfs = org.apache.hadoop.fs.FileSystem.get(
             new java.net.URI("hdfs://ITS-Hadoop10:9000/"), 
             new org.apache.hadoop.conf.Configuration()
           )
val path=new Path("/user/zhc/"+x+"/")
val t=hdfs.listStatus(path)
val in =hdfs.open(t(0).getPath)
val reader = new BufferedReader(new InputStreamReader(in))
var l=reader.readLine()

code credits: from another SO question

Maven dependencies:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.7.2</version> <!-- you can change this as per your hadoop version -->
</dependency>
1
Madhu Kiran Seelam On

It is much more easy in spark 2.0

val df = spark.read.json("json/file/location/in/hdfs")
df.show()
0
Rahul Modak On

One can use following in Spark to read the file from HDFS: val jsonText = sc.textFile("hdfs://url/user/request.json").collect.mkString("\n")