invalid input 'ðŸ“§' in 'utf8towcs when using tm and pdftools

Question

invalid input 'ðŸ“§' in 'utf8towcs when using tm and pdftools

1.7k views Asked by David van Oostveen At 16 May 2017 at 19:55

My work was going along smoothly but i encountered problems due to some of my pdf files containing weird symbols ("ðŸ“§")

I have reviewed the older discussion but none of those solutions worked: R tm package invalid input in 'utf8towcs'

This is my code so far:

setwd("E:/OneDrive/Thesis/Received comments document/Consultation 50")
getwd()
library(tm)
library(NLP)
library(tidytext)
library(dplyr)
library(pdftools)
files <- list.files(pattern = "pdf$")
comments <- lapply(files, pdf_text)
corp <- Corpus(VectorSource(comments))
corp <- VCorpus(VectorSource(comments));names(corp) <- files
Comments.tdm <- TermDocumentMatrix(corp, control = list(removePunctuation =     TRUE,
                                                        stopwords = TRUE,
                                                        tolower = TRUE,
                                                        stemming = TRUE,
                                                        removeNumbers = TRUE,
                                                        bounds = list(global = c(3, Inf))))

Results in: Error in .tolower(txt) : invalid input 'ðŸ“§' in 'utf8towcs'

inspect(Comments.tdm[1:32,])

ap_td <- tidy(Comments.tdm)
write.csv(ap_td, file = "Terms 50.csv")

Any help is much appreciated. ps, this code worked perfectly on other pdf's.

Original Q&A

There are 1 answers

**David van Oostveen** · Answer 1 · 2017-05-18T08:47:26+00:00

Took another look at the earlier discussion. this solution finally worked for me:

myCleanedText <- sapply(myText, function(x) iconv(enc2utf8(x), sub = "byte"))

remember to follow Fransisco's instructions: "Chad's solution wasn't working for me. I had this embedded in a function and it was giving an error about iconv neededing a vector as input. So, I decided to do the conversion before creating the corpus."

my code now looks like this:

files <- list.files(pattern = "pdf$")
comments <- lapply(files, pdf_text)
comments <- sapply(comments, function(x) iconv(enc2utf8(x), sub = "byte"))
corp <- Corpus(VectorSource(comments))

corp <- VCorpus(VectorSource(comments));names(corp) <- files
Comments.tdm <- TermDocumentMatrix(corp, control = list(removePunctuation = TRUE,
                                                        stopwords = TRUE,
                                                        tolower = TRUE,
                                                        stemming = TRUE,
                                                        removeNumbers = TRUE,
                                                        bounds = list(global = c(3, Inf)))) 

inspect(Comments.tdm[1:28,])

ap_td <- tidy(Comments.tdm)
write.csv(ap_td, file = "Terms 44.csv")

TechQA.

invalid input 'ðŸ“§' in 'utf8towcs when using tm and pdftools

Results in: Error in .tolower(txt) : invalid input 'ðŸ“§' in 'utf8towcs'

There are 1 answers

Related Questions in PDF

Related Questions in TM

Related Questions in XPDF

Popular Questions

Popular Tags

Trending Questions