How to extract calculations using tf-idf

68 views Asked by At

I used TfidfVectorizer to extract TF-IDF but don't know how it calculates the results. When I calculate it manually, I get a different answer, so I want to extract the values ​​that the function calculates in order to learn how it works.

data = ['Souvenir shop|Architecture and art|Culture and history', 'Souvenir shop|Resort|Diverse cuisine|Fishing|Folk games|Beautiful scenery', 'Diverse cuisine|Resort|Beautiful scenery']

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(data)
1

There are 1 answers

2
petezurich On BEST ANSWER

Have a look in the scikit documentation at the attributes section.

Try this:

print(vectorizer.vocabulary_)

Output

{'souvenir': 14,
 'shop': 13,
 'architecture': 1,
 'and': 0,
 'art': 2,
 'culture': 5,
 'history': 10,
 'resort': 11,
 'diverse': 6,
 'cuisine': 4,
 'fishing': 7,
 'folk': 8,
 'games': 9,
 'beautiful': 3,
 'scenery': 12}

You get the idf calculations with print(vectorizer.idf_)

Output

array([1.69314718, 1.69314718, 1.69314718, 1.28768207, 1.28768207,
       1.69314718, 1.28768207, 1.69314718, 1.69314718, 1.69314718,
       1.69314718, 1.28768207, 1.28768207, 1.28768207, 1.28768207])

For your case you can do this (with pandas):

df_idf = pd.DataFrame(
    vectorizer.idf_, index=vectorizer.get_feature_names_out(), columns=["idf_weights"]
)

display(df_idf)

Output

             idf_weights
and          1.693147
architecture 1.693147
art          1.693147
beautiful    1.287682
cuisine      1.287682
culture      1.693147
diverse      1.287682
fishing      1.693147
folk         1.693147
games        1.693147
history      1.693147
resort       1.287682
scenery      1.287682
shop         1.287682
souvenir     1.287682