I have a lot of text files which represent messages. I want to analyze them in the tm package in R, so I need to get them into R. What is an efficient way to read all the words in the messages into R? Something like:
txts <- Sys.glob("*.txt")
for (f in txts) {
tempData <- as.data.frame(scan(f, what="raw", quiet = TRUE))
data <- rbind(data, tempData)
}
simply takes forever and doesn't work very well. How do I read all the complete words in all the files and get them into R quickly?
Bonus trickery: Some of the files seem to have been generated weirdly and now have some words on a new line like
h
e
l
l
o
Is there a way to either ignore words that are really short (already when reading them into R) or to make R string them all together?