textstat_keyness in Quanteda is used to compare the relative frequency of WORDS/LEMMAS in two (sub)corpora. But I want to compare parts of speech--not words. Then I want to plot it. I've been able to use textstat_keyness for words, no problem, using the following:
# compare subcorpusA v subcorpusB terms using grouping
genre <- ifelse(docvars(corpusAB, "genre") == "group", "group", "group2")
dfmat3 <- dfm(corpusAB, groups = genre)
head(tstat1 <- textstat_keyness(dfmat3, measure = "lr", sort = TRUE, correction = "williams"), 20)
tail(tstat1, 20)
head(dfmat3)
textplot_keyness(tstat1, show_reference = TRUE,
show_legend = TRUE,
n = 40,
min_count = 5, margin = 0.05,
color = c("darkblue", "gray")
, labelcolor = "gray30",
labelsize = 2,
font = NULL)
I've also tokenized the corpus using tokens(), and I've parsed using spacy_parse. But I can't figure out how to connect the two. Is there a way to tell Quanteda to run textstat_keyness on POS instead of words?
For this you will need to tag the POS, and then treat the POS as a token. This is easy with the spacyr package, which integrates nicely with quanteda.
That's just a data.frame, so we can replace the token with the POS column.
Now, computing keyness on the POS is straightforward.
Patterns? Obama (the target) used more "adpositions" (prepositions and postpositions), Trump used more spaces. (Space Force - go figure.)