How to apply a custom function to a quanteda corpus

405 views Asked by At

I'm trying to migrate a script from using tm to quanteda. Reading the quanteda documentation there is a philosophy about applying changes "downstream" so that the original corpus is unchanged. OK.

I previously wrote a script to find spelling mistakes in our tm corpus and had support from our team to create a manual lookup. So, I have a csv file with 2 columns, the first column is the misspelt term and the second column is the correct version of that term.

Using tm package previously I did this:

# Write a custom function to pass to tm_map
# "Spellingdoc" is the 2 column csv
library(stringr)
library(stringi)
library(tm)
stringi_spelling_update <- content_transformer(function(x, lut = spellingdoc) stri_replace_all_regex(str = x, pattern = paste0("\\b", lut[,1], "\\b"), replacement = lut[,2], vectorize_all = FALSE))

Then within my tm corpus transformations I did this:

mycorpus <- tm_map(mycorpus, function(i) stringi_spelling_update(i, spellingdoc))

What is the equivilent way to apply this custom function to my quanteda corpus?

2

There are 2 answers

2
Ken Benoit On BEST ANSWER

Impossible to know if that will work from your example, which leaves some parts out, but generally:

If you want to access texts in a quanteda corpus, you can use texts(), and to replace those texts, texts()<-.

So in your case, assuming that mycorpus is a tm corpus, you could do this:

library("quanteda")
stringi_spelling_update2 <- function(x, lut = spellingdoc) {
    stringi::stri_replace_all_regex(str = x, 
                                    pattern = paste0("\\b", lut[,1], "\\b"), 
                                    replacement = lut[,2], 
                                    vectorize_all = FALSE)
}

myquantedacorpus <- corpus(mycorpus)
texts(mycorpus) <- stringi_spelling_update2(texts(mycorpus), spellingdoc)
0
Doug Fir On

I think I found an indirect answer over here.

texts(myCorpus) <- myFunction(myCorpus)