Parsing an XML SAX way in R

1.1k views Asked by At

Originating from this question, my research of R (and other) documentation indicates that SAX approach will be a faster way to parse XML data. Sadly I couldn't find much working examples for me to understand how to get there.

Here's a dummy file with information that I want parsed. The real thing would have substantially more <ITEM> nodes and other nodes all around the tree that I would like to exclude. Another peculiarity is that the <META> section has two <DESC> elements, and I need any one of them (not both).

<FILE>
  <HEADER>
    <FILEID>12347</FILEID>
  </HEADER>
  <META>
    <DESC>
      <TYPE>A</TYPE>
      <CODE>ABC</CODE>
      <VALUE>100000</VALUE>
    </DESC>
    <DESC>
      <TYPE>B</TYPE>
      <CODE>ABC</CODE>
      <VALUE>100000</VALUE>
    </DESC>
  </META>
  <BODY>
    <ITEM>
      <IVALUE>1000</IVALUE>
      <ICODE>CDF</ICODE>
      <ITYPE>R</ITYPE>
    </ITEM>
    <ITEM>
      <IVALUE>1500</IVALUE>
      <ICODE>EGK</ICODE>
      <ITYPE>R</ITYPE>
    </ITEM>
    <ITEM>
      <IVALUE>300</IVALUE>
      <ICODE>TSR</ICODE>
      <ITYPE>R</ITYPE>
    </ITEM>
  </BODY>
</FILE>

For the example XML above I'm looking to get

> data.table(fileid=12347, code="ABC", value=10000, ivalue=c(1000,1500,300), icode=c("CDF","EGK","TSR"), itype="R")
#    fileid code value ivalue icode itype
# 1:  12347  ABC 10000   1000   CDF     R
# 2:  12347  ABC 10000   1500   EGK     R
# 3:  12347  ABC 10000    300   TSR     R    

Could anyone with SAX experience guide me to building a parser to suit my needs with xmlEventParse()?

2

There are 2 answers

0
user227710 On

May be something like this?

library(rvest)
library(data.table)


test<-read_html("test.html") 
    data.table(do.call(cbind,lapply(c("fileid","code","value","ivalue","icode","itype"),function(i){
        test %>%
        html_nodes(i)%>%
        html_text()


    })))

         V1  V2     V3   V4  V5 V6
    1: 12347 ABC 100000 1000 CDF  R
    2: 12347 ABC 100000 1500 EGK  R
    3: 12347 ABC 100000  300 TSR  R
0
eblondel On

The Simple API for XML might improve the speed in parsing the XML data vs. another approach, but generally using SAX will not give you better results than XPath for example. On the contrary, for bigger files, it will allow not to load the complete tree in R, and thus avoid potential memory leaks.

For using SAX, you can use the below code example, which is based on the xmlEventParse branches (one branch per data you want to retrieve):

#a file to read with xmlEventParse
xmlDoc <- "example.xml"

desc <- NULL
items <- NULL

#function to use with xmlEventParse
row.sax = function() {

    #SAX function for Meta 'DESC'
    DESC = function(node){
        children <- xmlChildren(node)
        children[which(names(children) == "text")] <- NULL
        desc <<- rbind(desc, sapply(children,xmlValue))
    }

    #SAX function for Body 'ITEM'
    ITEM = function(node){
        children <- xmlChildren(node)
        children[which(names(children) == "text")] <- NULL
        items <<- rbind(items, sapply(children,xmlValue))
    }

    branches <- list(DESC = DESC, ITEM = ITEM)
    return(branches)
}

#call the xmlEventParse
xmlEventParse(xmlDoc, handlers = list(), branches = row.sax(),
              saxVersion = 2, trim = FALSE)

#processing the result as data.frame
desc <- as.data.frame(desc, stringsAsFactors = F)
desc <- desc[rep(row.names(desc[1,]), nrow(items)),]

items <- as.data.frame(items, stringsAsFactors = F)

result <- cbind(desc, items)
row.names(result) <- 1:nrow(result)

Let me know if it works for you