invalid input '📧' in 'utf8towcs when using tm and pdftools

1.7k views Asked by At

My work was going along smoothly but i encountered problems due to some of my pdf files containing weird symbols ("📧")

I have reviewed the older discussion but none of those solutions worked: R tm package invalid input in 'utf8towcs'

This is my code so far:

setwd("E:/OneDrive/Thesis/Received comments document/Consultation 50")
getwd()
library(tm)
library(NLP)
library(tidytext)
library(dplyr)
library(pdftools)
files <- list.files(pattern = "pdf$")
comments <- lapply(files, pdf_text)
corp <- Corpus(VectorSource(comments))
corp <- VCorpus(VectorSource(comments));names(corp) <- files
Comments.tdm <- TermDocumentMatrix(corp, control = list(removePunctuation =     TRUE,
                                                        stopwords = TRUE,
                                                        tolower = TRUE,
                                                        stemming = TRUE,
                                                        removeNumbers = TRUE,
                                                        bounds = list(global = c(3, Inf)))) 

Results in: Error in .tolower(txt) : invalid input '📧' in 'utf8towcs'

inspect(Comments.tdm[1:32,])

ap_td <- tidy(Comments.tdm)
write.csv(ap_td, file = "Terms 50.csv")

Any help is much appreciated. ps, this code worked perfectly on other pdf's.

1

There are 1 answers

0
David van Oostveen On

Took another look at the earlier discussion. this solution finally worked for me:

myCleanedText <- sapply(myText, function(x) iconv(enc2utf8(x), sub = "byte"))

remember to follow Fransisco's instructions: "Chad's solution wasn't working for me. I had this embedded in a function and it was giving an error about iconv neededing a vector as input. So, I decided to do the conversion before creating the corpus."

my code now looks like this:

files <- list.files(pattern = "pdf$")
comments <- lapply(files, pdf_text)
comments <- sapply(comments, function(x) iconv(enc2utf8(x), sub = "byte"))
corp <- Corpus(VectorSource(comments))

corp <- VCorpus(VectorSource(comments));names(corp) <- files
Comments.tdm <- TermDocumentMatrix(corp, control = list(removePunctuation = TRUE,
                                                        stopwords = TRUE,
                                                        tolower = TRUE,
                                                        stemming = TRUE,
                                                        removeNumbers = TRUE,
                                                        bounds = list(global = c(3, Inf)))) 

inspect(Comments.tdm[1:28,])

ap_td <- tidy(Comments.tdm)
write.csv(ap_td, file = "Terms 44.csv")