I want to read my xml into a dataframe in r. My intial Datafile is 14 GB so my initial try to read the file didn't work out:
f=xmlParse("Final.xml")
df=xmlToDataFrame(f)
r=xmlRoot(f)
The problem is that it is always running out of memory....
I've also seen the question:
How to read large (~20 GB) xml file in R?
I tried to use the approach from Martin Morgan, which i didn't 100% understood but tried to apply to my dataset.
libary(XML)
branchFunction <- function() {
store <- new.env()
func <- function(x, ...) {
ns <- getNodeSet(x, path = "//Sentiment")
value <- xmlValue(ns[[1]])
print(value)
# if storing something ...
# store[[some_key]] <- some_value
}
getStore <- function() { as.List(store) }
list(ROW = func, getStore=getStore)
}
myfunctions <- branchFunction()
xmlEventParse(
file = "Inputfile.xml",
handlers = NULL,
branches = myfunctions
))
myfunctions$getStore()
I would have to do that for every Column separately and the structure i'm getting from the ouptput is not useful.
The Structure from my Data looks like:
<ROWSET>
<ROW>
<Field1>21706</Field1>
<PostId>19203</PostId>
<ThreadId>38</ThreadId>
<UserId>1397</UserId>
<TimeStamp>1407351854</TimeStamp>
<Upvotes>0</Upvotes>
<Downvotes>0</Downvotes>
<Flagged>f</Flagged>
<Approved>t</Approved>
<Deleted>f</Deleted>
<Replies>0</Replies>
<ReplyTo>egergeg</ReplyTo>
<Content>dsfg</Content>
<Sentiment>Neutral</Sentiment>
</ROW>
<ROW>
<Field1>217</Field1>
<PostId>1903</PostId>
<ThreadId>8</ThreadId>
<UserId>197</UserId>
<TimeStamp>1407351854</TimeStamp>
<Upvotes>0</Upvotes>
<Downvotes>0</Downvotes>
<Flagged>f</Flagged>
<Approved>t</Approved>
<Deleted>f</Deleted>
<Replies>0</Replies>
<ReplyTo>sdrwer</ReplyTo>
<Content>wer</Content>
<Sentiment>Neutral</Sentiment>
</ROW>
<ROW>
<Field1>21306</Field1>
<PostId>19103</PostId>
<ThreadId>78</ThreadId>
<UserId>13497</UserId>
<TimeStamp>1407321854</TimeStamp>
<Upvotes>0</Upvotes>
<Downvotes>0</Downvotes>
<Flagged>f</Flagged>
<Approved>t</Approved>
<Deleted>f</Deleted>
<Replies>0</Replies>
<ReplyTo>tzjtj</ReplyTo>
<Content>rtgr</Content>
<Sentiment>Neutral</Sentiment>
</ROW>
</ROWSET>
In your case, since you deal with big datasets, you should indeed use
xmlEventParse
which relies on the SAX, ie the Simple API for XML.The advantage of this vs. usingxmlParse
is that you will not load the XML tree in R (which can cause memory leaks if data is really big...).I don't have a big dataset in hands, so i cannot test in real conditions but you can try this code snippet:
Let me know how it runs!