How can I perform LDA (latent Dirichlet allocation) on Noun Phrases in R instead of words?

821 views Asked by At

I want to generate topics from my text at the level of phrases, rather than at the level of words using LDA (latent Dirichlet allocation). How can I do that in R?

LDA interprets the documents as bag-of-words and produces topics with constituting words. For example, a sample output from text "Arsenal won FA cup in two consecutive years in 2014 and 2015. They are the kings of North London.", could yield topic [Arsenal - 50%, FA - 20%, cup - 10%, london - 10%, king - 10%]

I want it to return the topic at the level of phrases, i.e., [Arsenal, fa cup, north london]

1

There are 1 answers

0
Nick Kennedy On

I'm not aware of any way of pulling out the phrases automatically within R. However, it would be possible to change the input text such that the phrases were kept together with underscores or another character. For example, you could do the following:

example <- "Arsenal won FA cup in two consecutive years in 2014 and 2015. They are the kings of North London."

phrases <- c("FA cup", "North London")
phrasesNbsp <- gsub(" ", "_", phrases, fixed = TRUE)
for (i in 1:length(phrases)) {
  example <- gsub(phrases[i], phrasesNbsp[i], example, fixed = TRUE)
}
lda::lexicalize(example)