I am pretty new to Python programming. We recently got into a project where I am required to convert XML document to a text or a CSV format so that we can then load it to Hadoop and present it to users via Impala. Below is a sample XML and I am looking for an output format like shown below. I hate to directly ask for help, but since this is completely new to me, reaching out to you experts.
Sample XML:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<GISF XMLNS="TOOL.COM">
<HEADER>
<FILE_NAME>TOY.xml</FILE_NAME>
<DATE>20130611</DATE>
<TIME>14:02:00</TIME>
</HEADER>
<ISSUER>
<ID>123456</ID>
<INS>
<INS_ID>34685</INS_ID>
<SERIES></SERIES>
<MAT>
<MAT_ID>2233445566</MAT_ID>
<C_TYPE>BCG</C_TYPE>
<COL_TYPE></COL_TYPE>
<MAT_RAT_GR>
<RGC>STR</RGC>
<MAT_RA>
<TYPE>FC</TYPE>
<RAT>GGG</RAT>
</MAT_RA>
<MAT_RA>
<TYPE>FC2</TYPE>
<RAT>GGG2</RAT>
</MAT_RA>
</MAT_RAT_GR>
<IDENTIFIER NAME="ABCD" VALUE="GOR345"></IDENTIFIER>
<IDENTIFIER NAME="EFGH" VALUE="QELH7876"></IDENTIFIER>
</MAT>
</INS>
</ISSUER>
<ISSUER>
<ID>777888</ID>
<INS>
<INS_ID>444555</INS_ID>
<SERIES></SERIES>
<MAT>
<MAT_ID>444555666</MAT_ID>
<C_TYPE>BCD</C_TYPE>
<COL_TYPE></COL_TYPE>
<MAT_RAT_GR>
<RGC>STR</RGC>
<MAT_RA>
<TYPE>FC3</TYPE>
<RAT>GGG4</RAT>
</MAT_RA>
</MAT_RAT_GR>
<IDENTIFIER NAME="ABCD" VALUE="GOR345"></IDENTIFIER>
<IDENTIFIER NAME="EFGH" VALUE="QELH7876"></IDENTIFIER>
</MAT>
</INS>
</ISSUER>
</GISF>
Expected Output:
ID INS_ID MAT_ID TYPE RAT
123456 , 34685 , 2233445566 , FC , GGG
123456 , 34685 , 2233445566 , FC2 , GGG2
777888 , 444555 , 444555666 , FC3 , GGG4
I tried below script it's giving me output in csv format. But I really think there is a better way of doing this as I am manually traversing through different levels and also hardcoding the Tag names.
import xml.etree.ElementTree as ET
tree = ET.parse("GISF.xml")
root = tree.getroot()
for ISSUER in root.findall('ISSUER'):
Iss_id = ISSUER.find('ID').text
for INSTRUMENT in ISSUER.findall('INS'):
ins_id = INSTRUMENT.find('INS_ID').text
for MATURITY in INSTRUMENT.findall('MAT'):
may_id = MATURITY.find('MAT_ID').text
line = Iss_id+','+ins_id+','+may_id
print (line)
Generated Output:
123456,34685,2233445566
777888,444555,444555666
https://pypi.python.org/pypi/xmlutils
this is a good place to start.