Defining synonyms within a corpus of Documents using R

1.2k views Asked by At

I have a corpus of documents of a very specific topic (e.g. sports/athelics). Within that corpus, I would like to define synonyms myself. The reason why I want to define synonyms myself is because sometimes, given two words, it is possible that the synonyms() function within the WordNet package does not recognise them as synonyms, but within the text they can be interpreted as such (for example, "fit" and "strong").

My idea is to use word associations with Bygrams and Trigrams and define a synonym when words appear frequently in a phrase and have similar semantic content. For example, using the crude dataset within the tm package I would do something like:

data(crude)
options(mc.cores=1)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
crudetdm <- TermDocumentMatrix(crude, control=list(stripWhitespace = TRUE,
                                      removePunctuation = TRUE,
                                      removeNumbers = TRUE,
                                      stopwords = TRUE,
                                      removeSparseTerms = TRUE,
                                      tokenize = BigramTokenizer))

ListAssoc <- lapply(crudetdm$dimnames$Terms, function(x) findAssocs(crudetdm, x, 0.9))

However this returns (as expected) Bigrams associated with Bigrams, while my idea would be to find individual words associated with the Bigrams in crudetdm$dimnames$Terms (the same excersise with Trigrams would be the next step). For example, using Bygrams and the crude dataset, the ideal scenario would be ending up with a data.frame like:

Bigram              Associated Words
oil companies       policies, marketing, prices, measures, market, revenue...

Then I would go myself trough the table and manually select those words that I believe can be considered synonyms in my dataset (my dataset is not that big). I can think of some ways around by defining multiple data.frames of bigrams and trigrams and match common words. However, I am sure there is a more elegant and efficient way of doing this in R.

Overall, my question is. Given a series of Bigrams and Trigrams, how can I find individual words that are associated to them?

0

There are 0 answers