Using Python to convert Excel to XML with a complex/advanced mapping

1.3k views Asked by At

I am new to the coding world, and I am stuck on an issue for quite some time now.

I want to convert an excel file to an XML for certain analysis. The XML file however has a complex schema, and I am unable to replicate the Sample XML.

I have tried using Python with some of it's libraries, but I am unable to do so.

This is the data in XLSX file:

Excel Data

and the expected XML output is as follows:

<REP019-GIValueMeasuresReport
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xmlns="urn:fsa-gov-uk:MER:REP019:1"
     xsi:schemaLocation="urn:XXXXXXX:REP019:1 
                         http://XXXXXXXXX/REP019/v1/REP019-Schema.xsd"
                
            currency = "GBP"
            units=     "single">
                
                
       <Declaration>
                  <Return>no</Return>
       </Declaration>
       <ReturnDetails>
                    <Consent>yes</Consent>
                    <Informed>yes</Informed>
       </ReturnDetails>
       <ProductReport>
                  <Product>After the event legal expenses (All)</Product>
                   <FiveLargest>Intermediary Z</FiveLargest>
                   <FiveLargest>Intermediary X</FiveLargest>
                   <FiveLargest>Intermediary Y</FiveLargest>
                   <FiveLargest>Brand A</FiveLargest>
                   <FiveLargest>Brand B</FiveLargest>
                   <Sales>5000</Sales>
                   <CostToState>500000</CostToState>
                   <NumberofClaims>375</NumberofClaims>
                   <Average>6000.00</Average>
                   <Frequency>6.25</Frequency>
                   <NoofParts>320</NoofParts>
                   <Nulled>25</Nulled>
                   <Accepted>93.33</Accepted>
                   <Paid>32000</Paid>
                   <Avg_C>100</Avg_C>
                   <Highest>1200</Highest>
                   <MEH>25</MEH>
                   <Complaints>5</Complaints>
                   <Percentage>1.33</Percentage>
        </ProductReport>
        <ProductReport>
                  <Product>Alloy Wheel insurance (Add-on)</Product>
                   <FiveLargest>Intermediary Z</FiveLargest>
                   <FiveLargest>Intermediary X</FiveLargest>
                   <FiveLargest>Brand A</FiveLargest>              
                   <Sales>17000</Sales>
                   <CostToState>765000</CostToState>
                   <NumberofClaims>340</NumberofClaims>
                   <Average>16400.00</Average>
                   <Frequency>2.07</Frequency>
                   <NoofParts>280</NoofParts>
                   <Nulled>78</Nulled>
                   <Accepted>77.06</Accepted>
                   <Paid>39200</Paid>
                   <Avg_C>140</Avg_C>
                   <Highest>2000</Highest>
                   <MEH>50</MEH>
                   <Complaints>20</Complaints>
                   <Percentage>5.88</Percentage>
                    </ProductReport>
        <ProductReport>
                  <Product>Alloy Wheel insurance (Stand-alone)</Product>
                   <FiveLargest>Brand A</FiveLargest>             
                   <Sales>100000</Sales>
                   <CostToState>7000000</CostToState>
                   <NumberofClaims>3500</NumberofClaims>
                   <Average>87000.00</Average>
                   <Frequency>4.02</Frequency>
                   <NoofParts>3500</NoofParts>
                   <Nulled>100</Nulled>
                   <Accepted>97.14</Accepted>
                   <Paid>2275000</Paid>
                   <Avg_C>650</Avg_C>
                   <Highest>10500</Highest>
                   <MEH>325</MEH>
                   <Complaints>170</Complaints>
                   <Percentage>4.86</Percentage>
        </ProductReport>
        <ProductReport>
                   <Product>Before the event legal expenses - home (All)</Product>
                   <FiveLargest>Intermediary Z</FiveLargest>
                   <FiveLargest>Intermediary X</FiveLargest>
                   <FiveLargest>Intermediary Y</FiveLargest>
                   <FiveLargest>Brand A</FiveLargest>
                   <FiveLargest>Brand B</FiveLargest>
                   <Sales>5000</Sales>
                   <CostToState>500000</CostToState>
                   <NumberofClaims>375</NumberofClaims>
                   <Average>6000.00</Average>
                   <Frequency>6.25</Frequency>
                   <NoofParts>320</NoofParts>
                   <Nulled>25</Nulled>
                   <Accepted>93.33</Accepted>
                   <Paid>32000</Paid>
                   <Avg_C>100</Avg_C>
                   <Highest>1200</Highest>
                   <MEH>25</MEH>
                   <Complaints>5</Complaints>
                   <Percentage>1.33</Percentage>
        </ProductReport>
        <ProductReport>
                   <Product>Before the event legal expenses - motor (All)</Product>
                   <FiveLargest>Intermediary Z</FiveLargest>
                   <FiveLargest>Intermediary X</FiveLargest>
                   <FiveLargest>Brand A</FiveLargest>              
                   <Sales>17000</Sales>
                   <CostToState>765000</CostToState>
                   <NumberofClaims>340</NumberofClaims>
                   <Average>16400.00</Average>
                   <Frequency>2.07</Frequency>
                   <NoofParts>280</NoofParts>
                   <Nulled>78</Nulled>
                   <Accepted>77.06</Accepted>
                   <Paid>39200</Paid>
                   <Avg_C>140</Avg_C>
                   <Highest>2000</Highest>
                   <MEH>50</MEH>
                   <Complaints>20</Complaints>
                   <Percentage>5.88</Percentage>
        </ProductReport>
        <ProductReport>
                   <Product>Before the event legal expenses - other (All)</Product>
                   <FiveLargest>Brand A</FiveLargest>             
                   <Sales>100000</Sales>
                   <CostToState>7000000</CostToState>
                   <NumberofClaims>3500</NumberofClaims>
                   <Average>87000.00</Average>
                   <Frequency>4.02</Frequency>
                   <NoofParts>3500</NoofParts>
                   <Nulled>100</Nulled>
                   <Accepted>97.14</Accepted>
                   <Paid>2275000</Paid>
                   <Avg_C>650</Avg_C>
                   <Highest>10500</Highest>
                   <MEH>325</MEH>
                   <Complaints>170</Complaints>
                   <Percentage>4.86</Percentage>
        </ProductReport>
    
</REP019-GIValueMeasuresReport>
    
                
                
1

There are 1 answers

0
moken On

Should be easy enough, read a tutorial like the GeekforGeeks on doing such a conversion,
https://www.geeksforgeeks.org/how-to-convert-excel-to-xml-format-in-python/

Since currency, units and the <Declaration> and <ReturnDetails> tags are common to all in the xml file and are all the same in the spreadsheet, I presume these can be taken from any row so no need to iterate for those just select a cell to use, like

currency = ws['A2'].value
units = ws['B2'].value

Currency and units can be added to the 'xml_schema =' text along with the rest of your schema text formatted as you like. <Declaration> and <ReturnDetails> can be achieved using the 'with tag' syntax without the iterations, e.g. this for <ReturnDetails>.

with tag('ReturnDetails'):
        with tag("Consent"):
            text(ws['D2'].value)
        with tag("Informed"):
            text(ws['E2'].value)

Do same for <Declaration> pointing to the correct cell value. Then for each of the rest of the rows follow the example in the tutorial to iterate (from columns 6 to 21) using outer tag <ProductReport> and inner tags per the column headers.

with tag('ProductReport'):
    with tag("Product"):
        text(row[0])
    with tag("FiveLargest"):
        text(row[1])

The part I'm not sure about is you show in the example xml the first <ProductReport> with 5 <FiveLargest> tags, the second has 3 etc. Do not know how this is determined as each row appears to have only one entry for this column, are there are multiple lines in these cells? hidden in the screen shot

           <FiveLargest>Intermediary Z</FiveLargest>
           <FiveLargest>Intermediary X</FiveLargest>
           <FiveLargest>Intermediary Y</FiveLargest>
           <FiveLargest>Brand A</FiveLargest>
           <FiveLargest>Brand B</FiveLargest>

If that is the case you'll need read the cell contents, split on the new line and iterate the list for the same tag name.

- -----------------Add Code example---------------
Below is an example of how to code this. If the <FiveLargest> are multilined this code will manage that.

from openpyxl import load_workbook
from yattag import Doc, indent


wb = load_workbook("afegI.xlsx")
ws = wb["afegI"]
currency = ws['A2'].value
units = ws['B2'].value

for row in ws.iter_rows(min_row=1, max_row=2, min_col=1, max_col=6):
    print([cell.value for cell in row])

doc, tag, text = Doc().tagtext()

xml_schema = """
<REP019-GIValueMeasuresReport
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns="urn:fsa-gov-uk:MER:REP019:1"
    xsi:schemaLocation="urn:XXXXXXX:REP019:1
                        http://XXXXXXXXX/REP019/v1/REP019-Schema.xsd"

             currency = "{0}"
             units="{1}">
"""
xml_schema_close = '</REP019-GIValueMeasuresReport>'
doc.asis(xml_schema)
with tag('Declaration'):
        with tag("Return"): text(ws['C2'].value)

with tag('ReturnDetails'):
        with tag("Consent"):
            text(ws['D2'].value)
        with tag("Informed"):
            text(ws['E2'].value)

for row in ws.iter_rows(min_row=2, max_row=10, min_col=6, max_col=21):
    row = [cell.value for cell in row]
    with tag('ProductReport'):
        with tag("Product"):
            text(row[0])
        multi_line = row[1].split('\n')
        for line in multi_line:
            with tag("FiveLargest"):
                text(line)
        with tag("Sales"):
            text(row[2])
        with tag("CostToState"):
            text(row[3])
        with tag("NumberofClaims"):
            text(row[4])
        with tag("Average"):
            text(row[5])
        with tag("Frequency"):
            text(row[6])
        with tag("NoofParts"):
            text(row[7])
        with tag("Nulled"):
            text(row[8])
        with tag("Accepted"):
            text(row[9])
        with tag("Paid"):
            text(row[10])
        with tag("Avg_C"):
            text(row[11])
        with tag("Highest"):
            text(row[12])
        with tag("MEH"):
            text(row[13])
        with tag("Complaints"):
            text(row[14])
        with tag("Percentage"):
            text(row[15])

doc.asis(xml_schema_close)
result = indent(
    doc.getvalue().format(currency, units),
    indentation='   ',
    indent_text=True
)

with open("output.xml", "w") as f:
    f.write(result)