I'm a beginner at R and having a bit of trouble using the tm
package. I need to extract specific data from page 55 through 300 of this and thought that R might be a good way to do so. (If anyone has a better idea, please let me know!) I did some searching and after installing the tm
package and the xpdf
package,
I've tried reading this and tried zx8754's solution with no luck. I suspect it has something to do with the readPDF command -- I get the following:
Error in readPDF(PdftotextOptions = "-layout") : unused argument (PdftotextOptions = "-layout")
I think it has to do with trying to use the tm
package and the xpdf
packages together, and so I read Tony Breyal's solution (I can't post more than 2 links), putting pdfinfo and pdftotext as environmental variables (I'm on Win 8) and restarting. I'm sure I'm missing something -- right now I have pdftotext.exe in my working directory in R. Can anyone help me configure this correctly so that the tm package calls on the xpdf files correctly and readPDF functions like it should?
Again, I'm very new to this, so apologies if I'm way off. All help would be very much appreciated.
Thanks in advance,
Justin
To get you started, here is an example of a complete
readPDF
command for reading a PDF file.readPDF
threw an error when I tried to retrieve the PDF file directly from the link you provided, so I downloaded the PDF file to my working directory first.The code above converted the PDF file to text and stored the result in
doc
.doc
is actually a list, as can be seen with the following code:The text of the PDF file is stored in
doc$content
, whiledoc$meta
includes various metadata about the PDF file. Each row ofdoc$content
is a line from the PDF file. Here are lines 300 through 310 of the PDF file:Hopefully that will help you get started.