Incorporating new articles in tfidf vector for online clustering

143 views Asked by At

I am building an Online news clustering system using Lucene and Mahout libraries in java. I intend to use vector space model and tfidf weights for Kmeans(or fuzzy/streamKmeans). My plan is : Cluster initial articles,assign new article to the cluster whose centroid is closest based on a small distance threshold. The leftover documents that aren’t associated with any old clusters form new data(new topics). Separately cluster them among themselves and add these temporary cluster centroids to the previous centroids. Less frequently, execute the full batch clustering to recluster the entire set of documents. The problem arises in comparing a new article to a centroid to assign it to an old cluster. The centroid dimension is number of distinct words in initial data. But the dimension of new article is different. I am following the book Mahout in Action. Is there any approach or some sort of feature extraction to handle this. The following similar links still remain unanswered: https://stats.stackexchange.com/questions/41409/bag-of-words-in-an-online-configuration-for-classification-clustering https://stats.stackexchange.com/questions/123830/vector-space-model-for-online-news-clustering Thanks in advance

1

There are 1 answers

0
Has QUIT--Anony-Mousse On

Increase the dimensionality as desired, using 0 as new values.

From a theoretical point of view, consider the vector space as infinite dimensional.