I'm running an experiment that include text documents that I need to calculate the (cosine) similarity matrix between all of them (to use for another calculation). For that I use sklearn's TfidfVectorizer:
corpus = [doc1, doc2, doc3, doc4]
vect = TfidfVectorizer(min_df=1, stop_words="english", use_idf=False)
tfidf = vect.fit_transform(corpus)
similarities = tfidf * tfidf.T
pairwise_similarity_matrix = similarities.A
The problem is that with each iteration of my experiment I discover new documents that I need to add to my similarity matrix, and given the number of documents I'm working with (tens of thousands and more) - it is very time consuming.
I wish to find a way to calculate only the similarities between the new batch of documents and the existing ones, without computing it all again one the entire data set.
Note that I'm using a term-frequency (tf) representation, without using inverse-document-frequency (idf), so in theory I don't need to re-calculate the whole matrix each time.
OK, I got it. The idea is, as I said, to calculate the similarity only between the new batch of files and the existing ones, which their similarity is unchanged. The problem is to keep the TfidfVectorizer's vocabulary updated with the newly seen terms.
The solution has 2 steps:
Here's the whole script - we first got the original corpus and the trained and calculated objects and matrices:
Now, given new documents:
We can check this by comparing the calculated
similarities_matrix
we got, with the one we get when we train aTfidfVectorizer
on the joint corpus:corpus + new_docs_corpus
.As discussed in the the comments, we can do all that only because we are not using the idf (inverse-document-frequency) element, that will change the representation of existing documents given new ones.