I'm trying to plot the TF.IDF of bigram from a collection of documents over a period of time. This is to detect the trending of word importance. The texts are from a dataset from SQL server. It has two column, one is the event text which I want to tokenizer and data mine, the other column indicate the time period that the text belong to (1/2010, 2/2010 and so forth). I could query the SQL several times and create multiple corpus for each period but that is not efficient. I rather call my query one time and get everything back in one dataset and one unified corpus.
I have a pseudo code in mind but not sure it's the right way.
While Loop
Get subset of unified corpus for a given month
Convert the subset to dtm
Calculate tf-idf
Save tf-idf value to a list (hash table) with a key of (i am not sure yet)
Until last month
Plot the tf-idf for a given bi-gram over the month
I have this below so far and haven't gotten any ideas how to proceed. How do i subset the unified corpus into individual one based on the time period ? or how do I associate the month-year to the corpus ? and assume the logic below is the right way to solve my problem, when I get a list of tfxidf back, how do I plot the tfxidf for a given bigram?
Thank you
list_corpora <- lapply(1:length(list_text), function(i) Corpus(VectorSource(list_exam[[i]])))
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
list_corpora <- lapply(1:length(list_corpora), function(i) tm_map(list_corpora[[i]], FUN = tm_reduce, tmFuns = funcs))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
list_dtms <- lapply(1:length(list_corpora), function(i) TermDocumentMatrix(list_corpora1[[i]], control = list(tokenize = BigramTokenizer)))
list_tfxidf <- lapply(1:length(list_corpora), function(i) weightTfIdf(list_corpora[[i]])