How to add Header info to row info while parsing a xml with spark

296 views Asked by At

I have a xml structure like

 <root>
    <bookinfo>
    <time>1232314973</time>
   <requestID>233</requestID>
   <supplier>asd123</supplier>
  </bookinfo>

 <books>
  <book>
         <name>book1</name>
          <pages>124</pages>
    </book>
    <book>
         <name>book2</name>
          <pages>456</pages>
    </book>
    <book>
         <name>book4</name>
          <pages>789</pages>
    </book>
 </books>
</root>

I know that I can parse the books like:

val xml = sqlContext.read.format("com.databricks.spark.xml")
                  .option("rowTag", "book").load("FILENAME")

But I would like to add the Header information like supplier to each of the rows.

Is there a way to add this "headerinfo" to all rows with spark without loading the file twice and store the info in global vars/vals?

Thanks in advance!

1

There are 1 answers

1
pasha701 On BEST ANSWER

You can read all xml starting from "root" tag, and then explode required tags:

val df = hiveContext.read.format("xml").option("rowTag", "root").load("books.xml")
df.printSchema()
df.show(false)

println("-- supplier --")
val supplierDF = df.select(col("bookinfo.supplier"))
supplierDF.printSchema()
supplierDF.show(false)

println("-- books --")
val booksDF = df.select(explode(col("books.book")).alias("bookDetails"))
booksDF.printSchema()
booksDF.show(false)

println("-- bookDetails --")
val booksDetailsDF = booksDF.select(col("bookDetails.name"), col("bookDetails.pages"))
booksDetailsDF.printSchema()
booksDetailsDF.show(false)

Output:

root
 |-- bookinfo: struct (nullable = true)
 |    |-- requestID: long (nullable = true)
 |    |-- supplier: string (nullable = true)
 |    |-- time: long (nullable = true)
 |-- books: struct (nullable = true)
 |    |-- book: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- pages: long (nullable = true)

+-----------------------+-----------------------------------------------------+
|bookinfo               |books                                                |
+-----------------------+-----------------------------------------------------+
|[233,asd123,1232314973]|[WrappedArray([book1,124], [book2,456], [book4,789])]|
+-----------------------+-----------------------------------------------------+

-- supplier --
root
 |-- supplier: string (nullable = true)

+--------+
|supplier|
+--------+
|asd123  |
+--------+

-- books --
root
 |-- bookDetails: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- pages: long (nullable = true)

+-----------+
|bookDetails|
+-----------+
|[book1,124]|
|[book2,456]|
|[book4,789]|
+-----------+

-- bookDetails --
root
 |-- name: string (nullable = true)
 |-- pages: long (nullable = true)

+-----+-----+
|name |pages|
+-----+-----+
|book1|124  |
|book2|456  |
|book4|789  |
+-----+-----+