Read large xml into Dataframe r

1.5k views Asked by At

I want to read my xml into a dataframe in r. My intial Datafile is 14 GB so my initial try to read the file didn't work out:

f=xmlParse("Final.xml")
df=xmlToDataFrame(f)
r=xmlRoot(f)

The problem is that it is always running out of memory....

I've also seen the question:

How to read large (~20 GB) xml file in R?

I tried to use the approach from Martin Morgan, which i didn't 100% understood but tried to apply to my dataset.

libary(XML)
branchFunction <- function() {
store <- new.env() 
func <- function(x, ...) {
 ns <- getNodeSet(x, path = "//Sentiment")
value <- xmlValue(ns[[1]])
print(value)
# if storing something ... 
# store[[some_key]] <- some_value
}
getStore <- function() { as.List(store) }
list(ROW = func, getStore=getStore)
}

myfunctions <- branchFunction()

xmlEventParse(
file = "Inputfile.xml", 
handlers = NULL, 
branches = myfunctions
))

myfunctions$getStore()

I would have to do that for every Column separately and the structure i'm getting from the ouptput is not useful.

The Structure from my Data looks like:

<ROWSET>
<ROW>
    <Field1>21706</Field1>
    <PostId>19203</PostId>
    <ThreadId>38</ThreadId>
    <UserId>1397</UserId>
    <TimeStamp>1407351854</TimeStamp>
    <Upvotes>0</Upvotes>
    <Downvotes>0</Downvotes>
    <Flagged>f</Flagged>
    <Approved>t</Approved>
    <Deleted>f</Deleted>
    <Replies>0</Replies>
    <ReplyTo>egergeg</ReplyTo>
    <Content>dsfg</Content>
<Sentiment>Neutral</Sentiment>
</ROW>
<ROW>
    <Field1>217</Field1>
    <PostId>1903</PostId>
    <ThreadId>8</ThreadId>
    <UserId>197</UserId>
    <TimeStamp>1407351854</TimeStamp>
    <Upvotes>0</Upvotes>
    <Downvotes>0</Downvotes>
    <Flagged>f</Flagged>
    <Approved>t</Approved>
    <Deleted>f</Deleted>
    <Replies>0</Replies>
    <ReplyTo>sdrwer</ReplyTo>
    <Content>wer</Content>
<Sentiment>Neutral</Sentiment>
</ROW>
<ROW>
    <Field1>21306</Field1>
    <PostId>19103</PostId>
    <ThreadId>78</ThreadId>
    <UserId>13497</UserId>
    <TimeStamp>1407321854</TimeStamp>
    <Upvotes>0</Upvotes>
    <Downvotes>0</Downvotes>
    <Flagged>f</Flagged>
    <Approved>t</Approved>
    <Deleted>f</Deleted>
    <Replies>0</Replies>
    <ReplyTo>tzjtj</ReplyTo>
    <Content>rtgr</Content>
<Sentiment>Neutral</Sentiment>
</ROW>
</ROWSET>
1

There are 1 answers

3
eblondel On

In your case, since you deal with big datasets, you should indeed use xmlEventParse which relies on the SAX, ie the Simple API for XML.The advantage of this vs. using xmlParse is that you will not load the XML tree in R (which can cause memory leaks if data is really big...).

I don't have a big dataset in hands, so i cannot test in real conditions but you can try this code snippet:

xmlDoc <- "Final.xml"
result <- NULL

#function to use with xmlEventParse
row.sax = function() {
    ROW = function(node){
            children <- xmlChildren(node)
            children[which(names(children) == "text")] <- NULL
            result <<- rbind(result, sapply(children,xmlValue))
          }
    branches <- list(ROW = ROW)
    return(branches)
}

#call the xmlEventParse
xmlEventParse(xmlDoc, handlers = list(), branches = row.sax(),
              saxVersion = 2, trim = FALSE)

#and here is your data.frame
result <- as.data.frame(result, stringsAsFactors = F)

Let me know how it runs!