How to get DF from XML using read_xml() and xpath?

66 views Asked by At

Here is a XML with various values as follows. How can I parse each value?

df = pd.read_xml(StringIO(r.text), xpath='//*[local-name()="result"]')

This shows: RGT_STD_DT CASH_ALOC_RATIO ISSUCO_CUSTNO RGT_RACD 0 NaN NaN NaN NaN 1 NaN NaN
NaN NaN 2 NaN NaN NaN NaN 3
NaN NaN NaN NaN 4 NaN NaN
NaN NaN 5 NaN NaN NaN NaN 6
NaN NaN NaN NaN 7 NaN NaN
NaN NaN 8 NaN NaN NaN NaN 9
NaN NaN NaN NaN

<?xml version="1.0" encoding="UTF-8"?>
<vector result="10">
  <data vectorkey="0" type="Document">
    <result>
      <RGT_STD_DT value="20231005"/>
      <CASH_ALOC_RATIO value="220"/>
      <ISSUCO_CUSTNO value="997"/>
      <RGT_RACD value="103"/>
    </result>
  </data>
</vector>

Could you please hele me to get each value under "result"?

1

There are 1 answers

2
LMC On

Values are stored in attributes not on element's text.
For the given sample: read all attributes, transpose the frame, and give each column the element name

import pandas as pd
from io import StringIO

xml='''<?xml version="1.0" encoding="UTF-8"?>
<vector result="10">
  <data vectorkey="0" type="Document">
    <result>
      <RGT_STD_DT value="20231005"/>
      <CASH_ALOC_RATIO value="220"/>
      <ISSUCO_CUSTNO value="997"/>
      <RGT_RACD value="103"/>
    </result>
  </data>
</vector>'''

df = pd.read_xml(StringIO(xml), xpath='//*[local-name()="result"]/*')
df2 = df.transpose()
print(df2)
df2.columns = ["RGT_STD_DT" ,"CASH_ALOC_RATIO", "ISSUCO_CUSTNO", "RGT_RACD"] 
print(df2)

Transposed frame

              0    1    2    3
value  20231005  220  997  103

Final result with column names

       RGT_STD_DT  CASH_ALOC_RATIO  ISSUCO_CUSTNO  RGT_RACD
value    20231005              220            997       103

if more that 1 result element is present use

i=0
df2 = []
while i < len(df.index):
    tdf = df[i:i+4]
    tdf = tdf.transpose()
    tdf.columns = ["RGT_STD_DT" ,"CASH_ALOC_RATIO", "ISSUCO_CUSTNO", "RGT_RACD"]
    df2.append(tdf)
    i=i+4

print(pd.concat(df2))

Result

       RGT_STD_DT  CASH_ALOC_RATIO  ISSUCO_CUSTNO  RGT_RACD
value    20231005              220            997       103
value    20231105              339            768        97

Using Pandas 2.1.1

This version has much better handling of NaNs or NaTs

df = pd.read_xml(StringIO(xml), xpath='//*[local-name()="result"]/*', attrs_only=True)

i=0
df2 = []
while i < len(df.index):
    tdf = df[i:i+4]
    tdf = tdf.transpose()
    tdf.columns = ["RGT_STD_DT", "TH1_PAY_TERM_BEGIN_DT", "SHOTN_ISIN", "ISSUCO_CUSTNO"]
    tdf['RGT_STD_DT'] = pd.to_datetime(tdf['RGT_STD_DT'], format='%Y%m%d')
    tdf['TH1_PAY_TERM_BEGIN_DT'] = pd.to_datetime(tdf['TH1_PAY_TERM_BEGIN_DT'], format='%Y%m%d')
    df2.append(tdf)
    i=i+4
    

print(pd.concat(df2))

Result NaN now appears as NaT since dates are parsed. value="" would produce a NaT

      RGT_STD_DT TH1_PAY_TERM_BEGIN_DT  SHOTN_ISIN  ISSUCO_CUSTNO
value 2023-11-01                   NaT    183190.0        21505.0
value 2023-10-31                   NaT    448730.0        48052.0
value 2023-10-31                   NaT    453440.0        48543.0
value 2023-10-05            2023-10-25      9970.0          997.0
value 2023-10-04            2023-10-18     78520.0         9602.0