I analyze some brands in text to find out KPI´s like Ad recognition. However brands which contain special characters are destroyed by my code so far.
library(qdap)
library(stringr)
test <- c("H&M", "C&A", "Zalando", "Zalando", "Amazon", "Sportscheck")
wfm(test)
This is the output:
all
a 1
amazon 1
c 1
h 1
m 1
sportscheck 1
zalando 2
Is there a package or method to archieve that H&M gets h&m, but not "h" and "m", like its two brands?
edit: The wfm function has got a ... argument which SHOULD allow me to use the strip function.
wfm(test, ... = strip(test, char.keep = "&"))
Does not work unfortunately.
I would say something like this. In the udpipe package there is a function
document_term_frequencies
where you can specify the split and it turns the data into a data.frame with the frequency count. If there is no id column to specify it will generate one. The resulting object of thedocument_term_frequencies
is a data.table.