Originating from this question, my research of R (and other) documentation indicates that SAX approach will be a faster way to parse XML data. Sadly I couldn't find much working examples for me to understand how to get there.
Here's a dummy file with information that I want parsed. The real thing would have substantially more <ITEM>
nodes and other nodes all around the tree that I would like to exclude. Another peculiarity is that the <META>
section has two <DESC>
elements, and I need any one of them (not both).
<FILE>
<HEADER>
<FILEID>12347</FILEID>
</HEADER>
<META>
<DESC>
<TYPE>A</TYPE>
<CODE>ABC</CODE>
<VALUE>100000</VALUE>
</DESC>
<DESC>
<TYPE>B</TYPE>
<CODE>ABC</CODE>
<VALUE>100000</VALUE>
</DESC>
</META>
<BODY>
<ITEM>
<IVALUE>1000</IVALUE>
<ICODE>CDF</ICODE>
<ITYPE>R</ITYPE>
</ITEM>
<ITEM>
<IVALUE>1500</IVALUE>
<ICODE>EGK</ICODE>
<ITYPE>R</ITYPE>
</ITEM>
<ITEM>
<IVALUE>300</IVALUE>
<ICODE>TSR</ICODE>
<ITYPE>R</ITYPE>
</ITEM>
</BODY>
</FILE>
For the example XML above I'm looking to get
> data.table(fileid=12347, code="ABC", value=10000, ivalue=c(1000,1500,300), icode=c("CDF","EGK","TSR"), itype="R")
# fileid code value ivalue icode itype
# 1: 12347 ABC 10000 1000 CDF R
# 2: 12347 ABC 10000 1500 EGK R
# 3: 12347 ABC 10000 300 TSR R
Could anyone with SAX
experience guide me to building a parser to suit my needs with xmlEventParse()
?
May be something like this?