How does TfidfVectorizer calculate the TF-IDF number for each word?

67 views Asked by At

As I understand TF-IDF, the IDF value of the word "art" = log_e(3/1) + 1 because there are 3 documents in the data set and the word "art" appears once. But after I print IDF from the vectorizer.idf_ function, the IDF value of "art" is ~2.098612 but when the function calculates it, it is 1.693147.

And one more question: I calculated the TF value for the word "art" in data 1, which is equal to 0.125 because it appears once in a total of 8 words. Is my calculation correct like the TfidfVectorizer function?

This is my data and code

data = ['Souvenir shop|Architecture and art|Culture and history',
        'Souvenir shop|Resort|Diverse cuisine|Fishing|Shop games|Beautiful scenery',
        'Diverse cuisine|Resort|Beautiful scenery']
vectorizer = TfidfVectorizer()

tfidf_matrix = vectorizer.fit_transform(data)
1

There are 1 answers

0
Nick ODell On BEST ANSWER

As I understand TF-IDF, the IDF value of the word "art" = log_e(3/1) + 1 because there are 3 documents in the data set and the word "art" appears once. But after I print IDF from the vectorizer.idf_ function, the IDF value of "art" is ~2.098612 but when the function calculates it, it is 1.693147.

By default, TfidfVectorizer has a parameter smooth_idf set to True. The effect of this is that it adds one to both the numerator and denominator of the fraction inside the logarithm. If you turn off smooth_idf, you get your expected value.

Here is the formula with smooth_idf turned on:

idf("art") = ln((3 + 1)/(1 + 1)) + 1 = 1.6931

Here is the part of the code responsible for this calculation.

# perform idf smoothing if required
df += int(self.smooth_idf)
n_samples += int(self.smooth_idf)

# log+1 instead of log makes sure terms with zero idf don't get suppressed entirely.
idf = np.log(n_samples / df) + 1

(Source.)

Documentation

And one more question: I calculated the TF value for the word "art" in data 1, which is equal to 0.125 because it appears once in a total of 8 words. Is my calculation correct like the TfidfVectorizer function?

No, the TF is just the number of times the term appears. It's not normalized by document length in this step. There is a normalization step, but it's after multiplying by IDF.