I have a corpus of documents of a very specific topic (e.g. sports/athelics). Within that corpus, I would like to define synonyms myself. The reason why I want to define synonyms myself is because sometimes, given two words, it is possible that the synonyms()
function within the WordNet
package does not recognise them as synonyms, but within the text they can be interpreted as such (for example, "fit" and "strong").
My idea is to use word associations with Bygrams and Trigrams and define a synonym when words appear frequently in a phrase and have similar semantic content. For example, using the crude
dataset within the tm
package I would do something like:
data(crude)
options(mc.cores=1)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
crudetdm <- TermDocumentMatrix(crude, control=list(stripWhitespace = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removeSparseTerms = TRUE,
tokenize = BigramTokenizer))
ListAssoc <- lapply(crudetdm$dimnames$Terms, function(x) findAssocs(crudetdm, x, 0.9))
However this returns (as expected) Bigrams associated with Bigrams, while my idea would be to find individual words associated with the Bigrams in crudetdm$dimnames$Terms (the same excersise with Trigrams would be the next step). For example, using Bygrams and the crude
dataset, the ideal scenario would be ending up with a data.frame like:
Bigram Associated Words
oil companies policies, marketing, prices, measures, market, revenue...
Then I would go myself trough the table and manually select those words that I believe can be considered synonyms in my dataset (my dataset is not that big). I can think of some ways around by defining multiple data.frames of bigrams and trigrams and match common words. However, I am sure there is a more elegant and efficient way of doing this in R.
Overall, my question is. Given a series of Bigrams and Trigrams, how can I find individual words that are associated to them?