Parsing XML with Several Nodesets Using R

77 views Asked by At

I have an xml file with a lot of children and sub-children (I may be using the wrong terminology here). I've dug around StackOverflow looking for sample of something this complex, but my tries with the XML package and the xml2 package are not extracting any characters. I'm trying to parse out the node, "BusNm", from the xml, along with address data, and then load these variables into a dataframe.

Here's a sample of the file. I hope I loaded it correctly. What do you all think?

<?xml version="1.0" encoding="ISO-8859-1"?>
<IAPDFirmSECReport GenOn="2021-03-19">
<Firms>
<Firm>
<Info SECRgnCD="BRO" FirmCrdNb="133693" SECNb="801-73527" BusNm="H/2 CAPITAL PARTNERS" LegalNm="H/2 CREDIT MANAGER LP" UmbrRgstn="Y"/>
<MainAddr Strt1="680 WASHINGTON BOULEVARD" Strt2="SEVENTH FLOOR" City="STAMFORD" State="CT" Cntry="United States" PostlCd="06901-3700" PhNb="203-569-4000" FaxNb="203-569-4178"/>
<MailingAddr/>
<Rgstn FirmType="Registered" St="APPROVED" Dt="2012-03-26"/>
<NoticeFiled>
<States RgltrCd="CT" St="FILED" Dt="2015-01-08"/>
<States RgltrCd="NY" St="FILED" Dt="2017-03-02"/>
</NoticeFiled>
<Filing Dt="2020-03-26" FormVrsn="10/2017"/>
<FormInfo>
<Part1A>
<Item1 Q1F5="2" Q1I="N" Q1M="N" Q1N="N" Q1O="N" Q1P="549300XR0J90YJAJ0H10">
<WebAddrs/>
</Item1>
<Item2A Q2A1="Y" Q2A2="N" Q2A4="N" Q2A5="N" Q2A6="N" Q2A7="N" Q2A8="N" Q2A9="N" Q2A10="N" Q2A11="N" Q2A12="N" Q2A13="N"/>
<Item2B/>
<Item3A OrgFormNm="Limited Partnership"/>
<Item3B Q3B="DECEMBER"/>
<Item3C StateCD="DE" CntryNm="United States"/>
<Item5A TtlEmp="29"/>
<Item5B Q5B1="14" Q5B2="0" Q5B3="0" Q5B4="0" Q5B5="0" Q5B6="0"/>
<Item5C Q5C1="0" Q5C2="30"/>
<Item5D Q5DF1="16" Q5DF3="6391360468"/>
<Item5E Q5E1="Y" Q5E2="N" Q5E3="N" Q5E4="N" Q5E5="N" Q5E6="Y" Q5E7="N"/>
<Item5F Q5F1="Y" Q5F2A="6391360468" Q5F2B="0" Q5F2C="6391360468" Q5F2D="16" Q5F2E="0" Q5F2F="16" Q5F3="2504971810"/>
<Item5G Q5G1="N" Q5G2="N" Q5G3="N" Q5G4="Y" Q5G5="N" Q5G6="N" Q5G7="N" Q5G8="N" Q5G9="N" Q5G10="N" Q5G11="N" Q5G12="N"/>
<Item5H/>
<Item5I Q5I1="N"/>
<Item5J Q5J1="N" Q5J2="N"/>
<Item6A Q6A1="N" Q6A2="N" Q6A3="Y" Q6A4="N" Q6A5="N" Q6A6="N" Q6A7="N" Q6A8="N" Q6A9="N" Q6A10="N" Q6A11="N" Q6A12="N" Q6A13="N" Q6A14="N"/>
<Item6B Q6B1="N" Q6B3="N"/>
<Item7A Q7A1="N" Q7A2="N" Q7A3="N" Q7A4="N" Q7A5="N" Q7A6="Y" Q7A7="N" Q7A8="N" Q7A9="N" Q7A10="N" Q7A11="N" Q7A12="N" Q7A13="N" Q7A14="N" Q7A15="N" Q7A16="Y"/>
<Item7B Q7B="Y"/>
<Item8A Q8A1="N" Q8A2="N" Q8A3="N"/>
<Item8B Q8B1="N" Q8B2="N" Q8B3="N"/>
<Item8C Q8C1="Y" Q8C2="Y" Q8C3="Y" Q8C4="Y"/>
<Item8D Q8D="N"/>
<Item8E Q8E="N"/>
<Item8F/>
<Item8G Q8G1="N"/>
<Item8H Q8H1="N" Q8H2="N"/>
<Item8I Q8I="N"/>
<Item9A Q9A1A="Y" Q9A1B="Y" Q9A2A="4896054396" Q9A2B="16"/>
<Item9B Q9B1A="Y" Q9B1B="Y" Q9B2A="4896054396" Q9B2B="16"/>
<Item9C Q9C1="N" Q9C2="Y" Q9C3="N" Q9C4="N"/>
<Item9D Q9D1="N" Q9D2="N"/>
<Item9E/>
<Item9F Q9F="6"/>
<Item10A Q10A="N"/>
<Item11 Q11="N"/>
<Item11A Q11A1="N" Q11A2="N"/>
<Item11B Q11B1="N" Q11B2="N"/>
<Item11C Q11C1="N" Q11C2="N" Q11C3="N" Q11C4="N" Q11C5="N"/>
<Item11D Q11D1="N" Q11D2="N" Q11D3="N" Q11D4="N" Q11D5="N"/>
<Item11E Q11E1="N" Q11E2="N" Q11E3="N" Q11E4="N"/>
<Item11F Q11F="N"/>
<Item11G Q11G="N"/>
<Item11H Q11H1A="N" Q11H1B="N" Q11H1C="N" Q11H2="N"/>
</Part1A>
</FormInfo>
</Firm>
1

There are 1 answers

1
Dave2e On BEST ANSWER

I am going to assume your file will have multiple businesses each within its own "Firm" node.

This is a straight forward problem of identifying the correct child nodes and extracting the desired information from the attributes.

Here is a solution using the dplyr and xml2 packages. I modified your sample input to simulate 2 different business in the xml file.

page <- read_xml('<?xml version="1.0" encoding="ISO-8859-1"?>
<IAPDFirmSECReport GenOn="2021-03-19">
<Firms>
<Firm>
<Info SECRgnCD="BRO" FirmCrdNb="133693" SECNb="801-73527" BusNm="H/2 CAPITAL PARTNERS" LegalNm="H/2 CREDIT MANAGER LP" UmbrRgstn="Y"/>
<MainAddr Strt1="680 WASHINGTON BOULEVARD" Strt2="SEVENTH FLOOR" City="STAMFORD" State="CT" Cntry="United States" PostlCd="06901-3700" PhNb="203-569-4000" FaxNb="203-569-4178"/>
<MailingAddr/>
<Rgstn FirmType="Registered" St="APPROVED" Dt="2012-03-26"/>
<NoticeFiled>
<States RgltrCd="CT" St="FILED" Dt="2015-01-08"/>
<States RgltrCd="NY" St="FILED" Dt="2017-03-02"/>
</NoticeFiled>
<Filing Dt="2020-03-26" FormVrsn="10/2017"/>
</Firm>
<Firm>
<Info SECRgnCD="BRO" FirmCrdNb="133693" SECNb="801-73527" BusNm="Company2" LegalNm="H/2 CREDIT MANAGER LP" UmbrRgstn="Y"/>
<MainAddr Strt1="WASHINGTON Ave"  City="NY" State="NY" Cntry="United States" PostlCd="12601" PhNb="212-555-4000" FaxNb="212-555-4178"/>
<MailingAddr/>
<Rgstn FirmType="Registered" St="APPROVED" Dt="2012-03-26"/>
<NoticeFiled>
<States RgltrCd="CT" St="FILED" Dt="2015-01-08"/>
<States RgltrCd="NY" St="FILED" Dt="2017-03-02"/>
</NoticeFiled>
<Filing Dt="2020-03-26" FormVrsn="10/2017"/>
</Firm>
 </Firms> </IAPDFirmSECReport>')


library(xml2)
library(dplyr)
#assuming there is 1 info node for every MainAddr node
info <- xml_find_all(page, ".//Info")
addr <- xml_find_all(page, ".//MainAddr")

#extract the information from the attributes
#get Name from info node
BusName <- xml_attr(info, "BusNm")

#retrieve info from all of the attributes from MainAddr node
# and bind into a datafame
addresses <- bind_rows(xml_attrs(addr))

answer <- cbind(BusName, addresses)
answer
               BusName                    Strt1         Strt2     City State         Cntry    PostlCd         PhNb        FaxNb
1 H/2 CAPITAL PARTNERS 680 WASHINGTON BOULEVARD SEVENTH FLOOR STAMFORD    CT United States 06901-3700 203-569-4000 203-569-4178
2             Company2           WASHINGTON Ave          <NA>       NY    NY United States      12601 212-555-4000 212-555-4178