I'm trying to migrate a script from using tm to quanteda. Reading the quanteda documentation there is a philosophy about applying changes "downstream" so that the original corpus is unchanged. OK.
I previously wrote a script to find spelling mistakes in our tm corpus and had support from our team to create a manual lookup. So, I have a csv file with 2 columns, the first column is the misspelt term and the second column is the correct version of that term.
Using tm package previously I did this:
# Write a custom function to pass to tm_map
# "Spellingdoc" is the 2 column csv
library(stringr)
library(stringi)
library(tm)
stringi_spelling_update <- content_transformer(function(x, lut = spellingdoc) stri_replace_all_regex(str = x, pattern = paste0("\\b", lut[,1], "\\b"), replacement = lut[,2], vectorize_all = FALSE))
Then within my tm corpus transformations I did this:
mycorpus <- tm_map(mycorpus, function(i) stringi_spelling_update(i, spellingdoc))
What is the equivilent way to apply this custom function to my quanteda corpus?
Impossible to know if that will work from your example, which leaves some parts out, but generally:
If you want to access texts in a quanteda corpus, you can use
texts()
, and to replace those texts,texts()<-
.So in your case, assuming that
mycorpus
is a tm corpus, you could do this: