As I understand TF-IDF, the IDF value of the word "art" = log_e(3/1) + 1 because there are 3 documents in the data set and the word "art" appears once. But after I print IDF from the vectorizer.idf_ function, the IDF value of "art" is ~2.098612 but when the function calculates it, it is 1.693147.
And one more question: I calculated the TF value for the word "art" in data 1, which is equal to 0.125 because it appears once in a total of 8 words. Is my calculation correct like the TfidfVectorizer function?
This is my data and code
data = ['Souvenir shop|Architecture and art|Culture and history',
'Souvenir shop|Resort|Diverse cuisine|Fishing|Shop games|Beautiful scenery',
'Diverse cuisine|Resort|Beautiful scenery']
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(data)
By default, TfidfVectorizer has a parameter
smooth_idf
set toTrue
. The effect of this is that it adds one to both the numerator and denominator of the fraction inside the logarithm. If you turn offsmooth_idf
, you get your expected value.Here is the formula with
smooth_idf
turned on:Here is the part of the code responsible for this calculation.
(Source.)
Documentation
No, the TF is just the number of times the term appears. It's not normalized by document length in this step. There is a normalization step, but it's after multiplying by IDF.