How do you convert all the pdfs in a directory, into txt format, via R?

Question

How do you convert all the pdfs in a directory, into txt format, via R?

885 views Asked by stochastiq At 21 December 2013 at 11:09

I'm trying to convert a list of PDF files located in my computer directory, into txt format so that R can read it and begin text mining. Do you know what is wrong with this code?

library(tm) #load text mining library
setwd('D:/Directory') #sets R's working directory to near where my files are
ae.corpus<-Corpus(DirSource("D:/Directory/NewsArticles"),readerControl=list(reader=readPlain))
exe <- "C:\\Program Files\\xpdfbin-win-3.03\\bin32\\pdftotext.exe"
system(paste("\"", exe, "\" \"", ae.corpus, "\"", sep = ""), wait = F)
filetxt <- sub(".pdf", ".txt", dest)
shell.exec(filetxt); shell.exec(filetxt)    # strangely the first try always throws an error..

summary(ae.corpus) #check what went in
ae.corpus <- tm_map(ae.corpus, tolower)
ae.corpus <- tm_map(ae.corpus, removePunctuation)
ae.corpus <- tm_map(ae.corpus, removeNumbers)
myStopwords <- c(stopwords('english'), "available", "via")
ae.corpus <- tm_map(ae.corpus, removeWords, myStopwords) # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords 


ae.tdm <- DocumentTermMatrix(ae.corpus, control = list(minWordLength = 3))
inspect(ae.tdm)
findFreqTerms(ae.tdm, lowfreq=2)
findAssocs(ae.tdm, "economic",.7)
d<- Dictionary (c("economic", "uncertainty", "policy"))
inspect(DocumentTermMatrix(ae.corpus, list(dictionary = d)))

Original Q&A

There are 1 answers

**LkRR** · Answer 1 · 2014-03-06T12:48:30+00:00

Try and use this instead:

dest <- ""           #same as setwd()
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)
# convert each PDF file that is named in the vector into a text file
# text file is created in the same directory as the PDFs
lapply(myfiles, function(i) system(paste('""',    #the path to Program files where the pdftotext.exe is saved
                                     paste0('"', i, '"')), wait = FALSE) )

and then

#combine files
files <- list.files(pattern = "[.]txt$")
outFile <- file("output.txt", "w") 
for (i in files){ 
x <- readLines(i) 
writeLines(x[2:(length(x)-1)], outFile) 
} 
close(outFile) 

#read data
txt<-read.table('output.txt',sep='\t', quote = "")

How that helps!

TechQA.

How do you convert all the pdfs in a directory, into txt format, via R?

There are 1 answers

Related Questions in PDF

Related Questions in TEXT

Related Questions in TRANSFORM

Related Questions in MINING

Popular Questions

Popular Tags

Trending Questions