There are many different ways in which tf and idf can be calculated. I want to know which formula is used by gensim in its LSA model. I have been going through its source code lsimodel.py
, but it is not obvious to me where the document-term matrix is created (probably because of memory optimizations).
In one LSA paper, I read that each cell of the document-term matrix is the log-frequency of that word in that document, divided by the entropy of that word:
tf(w, d) = log(1 + frequency(w, d))
idf(w, D) = 1 / (-Σ_D p(w) log p(w))
However, this seems to be a very unusual formulation of tf-idf. A more familiar form of tf-idf is:
tf(w, d) = frequency(w, d)
idf(w, D) = log(|D| / |{d ∈ D: w ∈ d}|)
I also notice that there is a question on how the TfIdfModel
itself is implemented in gensim. However, I didn't see lsimodel.py
importing TfIdfModel
, and therefore can only assume that lsimodel.py
has its own implementation of tf-idf.
As I understand,
lsimodel.py
does not preform the tf-idf encoding step. You may find some details in gensim's API documentation - there's a dedicated tf-idf model, which can be employed to encode a text that can be later fed into the LSA model. From thetfidfmodel.py
source code it appears that the latter of two definitions of tf-idf you listed is followed.