Plot TF.IDF value of bigram over time

610 views Asked by At

I'm trying to plot the TF.IDF of bigram from a collection of documents over a period of time. This is to detect the trending of word importance. The texts are from a dataset from SQL server. It has two column, one is the event text which I want to tokenizer and data mine, the other column indicate the time period that the text belong to (1/2010, 2/2010 and so forth). I could query the SQL several times and create multiple corpus for each period but that is not efficient. I rather call my query one time and get everything back in one dataset and one unified corpus.

I have a pseudo code in mind but not sure it's the right way.

While Loop

Get subset of unified corpus for a given month
Convert the subset to dtm
Calculate tf-idf
Save tf-idf value to a list (hash table) with a key of (i am not sure yet)

Until last month

Plot the tf-idf for a given bi-gram over the month

I have this below so far and haven't gotten any ideas how to proceed. How do i subset the unified corpus into individual one based on the time period ? or how do I associate the month-year to the corpus ? and assume the logic below is the right way to solve my problem, when I get a list of tfxidf back, how do I plot the tfxidf for a given bigram?

Thank you

list_corpora <- lapply(1:length(list_text), function(i) Corpus(VectorSource(list_exam[[i]])))

skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
list_corpora <- lapply(1:length(list_corpora), function(i) tm_map(list_corpora[[i]], FUN = tm_reduce, tmFuns = funcs))

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

list_dtms <- lapply(1:length(list_corpora), function(i) TermDocumentMatrix(list_corpora1[[i]], control = list(tokenize = BigramTokenizer)))

list_tfxidf <- lapply(1:length(list_corpora), function(i) weightTfIdf(list_corpora[[i]])
0

There are 0 answers