How to parse through XML and convert to CSV using Python

525 views Asked by At

I am pretty new to Python programming. We recently got into a project where I am required to convert XML document to a text or a CSV format so that we can then load it to Hadoop and present it to users via Impala. Below is a sample XML and I am looking for an output format like shown below. I hate to directly ask for help, but since this is completely new to me, reaching out to you experts.

Sample XML:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<GISF XMLNS="TOOL.COM">
   <HEADER>
      <FILE_NAME>TOY.xml</FILE_NAME>
      <DATE>20130611</DATE>
      <TIME>14:02:00</TIME>
   </HEADER>
<ISSUER>
   <ID>123456</ID>
   <INS>
      <INS_ID>34685</INS_ID>
      <SERIES></SERIES>
      <MAT>
         <MAT_ID>2233445566</MAT_ID>
         <C_TYPE>BCG</C_TYPE>
         <COL_TYPE></COL_TYPE>
         <MAT_RAT_GR>
            <RGC>STR</RGC>
            <MAT_RA>
               <TYPE>FC</TYPE>
               <RAT>GGG</RAT>
            </MAT_RA>
            <MAT_RA>
               <TYPE>FC2</TYPE>
               <RAT>GGG2</RAT>
            </MAT_RA>
         </MAT_RAT_GR>
         <IDENTIFIER NAME="ABCD" VALUE="GOR345"></IDENTIFIER>
         <IDENTIFIER NAME="EFGH" VALUE="QELH7876"></IDENTIFIER>
      </MAT>
   </INS>
</ISSUER>
<ISSUER>
   <ID>777888</ID>
   <INS>
      <INS_ID>444555</INS_ID>
      <SERIES></SERIES>
      <MAT>
         <MAT_ID>444555666</MAT_ID>
         <C_TYPE>BCD</C_TYPE>
         <COL_TYPE></COL_TYPE>
         <MAT_RAT_GR>
            <RGC>STR</RGC>
            <MAT_RA>
               <TYPE>FC3</TYPE>
               <RAT>GGG4</RAT>
            </MAT_RA>
         </MAT_RAT_GR>
         <IDENTIFIER NAME="ABCD" VALUE="GOR345"></IDENTIFIER>
         <IDENTIFIER NAME="EFGH" VALUE="QELH7876"></IDENTIFIER>
      </MAT>
   </INS>
</ISSUER>
</GISF>

Expected Output:

ID           INS_ID       MAT_ID           TYPE       RAT

123456    ,   34685   ,    2233445566  ,      FC    ,     GGG
123456    ,   34685   ,    2233445566  ,      FC2    ,    GGG2
777888    ,   444555  ,    444555666   ,      FC3    ,    GGG4

I tried below script it's giving me output in csv format. But I really think there is a better way of doing this as I am manually traversing through different levels and also hardcoding the Tag names.

import xml.etree.ElementTree as ET
tree = ET.parse("GISF.xml")
root = tree.getroot()
for ISSUER in root.findall('ISSUER'):
    Iss_id = ISSUER.find('ID').text   
    for INSTRUMENT in ISSUER.findall('INS'):
        ins_id = INSTRUMENT.find('INS_ID').text
            for MATURITY in INSTRUMENT.findall('MAT'):
            may_id = MATURITY.find('MAT_ID').text
            line = Iss_id+','+ins_id+','+may_id
            print (line)

Generated Output:

    123456,34685,2233445566
    777888,444555,444555666
1

There are 1 answers

0
rajesh.kanakabandi On

https://pypi.python.org/pypi/xmlutils

this is a good place to start.