Hive XML Serde : Parsing error when loading xml content

514 views Asked by At

I am trying to load the following XML content using SerDe with Hive :

<?xml version="1.0"?>
<RootTag xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.website.com/service">
<Code>123</Code>
<ParentElement>
   <Entity>
      <EntityId>A</EntityId>
      <EntityCode i:nil="true"/>
   </Entity>
   <Entity>
      <EntityId>M</EntityId>
      <EntityCode i:nil="true"/>
   </Entity>
</ParentElement>
</RootTag>

The hive table was created as follows:

CREATE EXTERNAL TABLE database.mytable(
code   String, 
Entity array<struct<Entity:struct<EntityId:String,EntityCode:String>>>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES 
(
  "column.xpath.Code" = "/RootTag/Code/text()",
  "column.xpath.ParentElement" = "/RootTag/ParentElement"
)
STORED AS 
INPUTFORMAT  'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION     '/xml_content/'
TBLPROPERTIES ("xmlinput.start" = "<RootTag", "xmlinput.end" = "</RootTag>");

I have 2 issues:

  1. when I specify the "xmlinput.start" as specified above, it doesn't work. I had to remove manually the content "xmlns:i = ... /service" next to the "RootTage" in order to start parsing the xml.
  2. Even though, there is another issue with "EntityCode" attribute. I am getting the error:
Caused by: org.apache.hive.service.cli.HiveSQLException:
  java.io.IOException:
  org.apache.hadoop.hive.serde2.SerDeException: 
  java.lang.RuntimeException:
  org.xml.sax.SAXParseException;
  lineNumber: 41;
  columnNumber: 33;
  The prefix "i" for attribute "i:nil" associated with an element type "ParentCode" is not bound.

What am I doing wrong? I would appreciate your suggestions and comments regarding this.

1

There are 1 answers

0
Gaurang Shah On

There are few things.

  • I think the root tag with namespace is only supported on version 1.5.0.3 or greater. check the following link https://github.com/dvasilen/Hive-XML-SerDe/issues/15
  • I am don't think it supports the i:nil="true"
  • Mapping of column should match the column name and it's case sensitive. column.xpath.Code and code won't work. Also there is no corrosponding column for column.xpath.ParentElement

So what you need to do?

  • Upgrade your serde.
  • replace <EntityCode i:nil="true"/> to <EntityCode/> in your xml
  • Fix the code as mentioned below

Code

CREATE EXTERNAL TABLE temp.test_xml(
code   String,
Entity array<struct<Entity:struct<EntityId:String,EntityCode:String>>>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES 
(
  "column.xpath.code" = "/RootTag/Code/text()",
  "column.xpath.Entity" = "/RootTag/ParentElement/Entity"

)
STORED AS 
INPUTFORMAT  'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION     '/tmp/test_xml1/table/'
TBLPROPERTIES (
                "xmlinput.start"="<RootTag",
                "xmlinput.end"="</RootTag>");