sklearn TfidfVectorizer does't eliminate common words

1.1k views Asked by At

I analyse a corpus of lines:

 corpus = ['rabbit rabbit fish fish fish fish fish',
'turtle rabbit fish fish fish fish fish',
'raccoon raccoon raccoon fish fish fish fish fish']

For TF*IDF calculation I run a code as follows:

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)

The result is

The result

Why the result has such a big value for "fish". This is the common word, and according to TF-IDF, it should be zero since every document contains this word.

1

There are 1 answers

7
James_SO On

If you establish "fish" as a commonly used word, but then use "fish" five times in a seven word sentence, TF-IDF is still going to give you a high score for fish. That's because TF-IDF is Term Frequency multiplied by Inverse Document Freqency.

Just change up your last sentence like this and rerun

corpus = ['rabbit rabbit fish fish fish fish fish',
'turtle rabbit fish fish fish fish fish',
'raccoon raccoon turtle raccoon fish']

Now you'll see what TF-IDF is doing for you - in the last sentence it gives fish a lower value than turtle even though they are both used once, because fish is a more commonly used word in the corpus.

    fish        rabbit      raccoon     turtle
0   0.889003    0.457901    0.000000    0.000000
1   0.939620    0.241986    0.000000    0.241986
2   0.187453    0.000000    0.952154    0.241379

This is a bit messy, but here's the process using basic math

Tokenizing

from nltk.tokenize import TreebankWordTokenizer
from collections import Counter, OrderedDict
import copy
import pandas as pd
corpus = ['rabbit rabbit fish fish fish fish fish',
'turtle rabbit fish fish fish fish fish',
'raccoon raccoon turtle raccoon fish']

oneBigText = " ".join(corpus)
#Tokenize everything as one big text
thisTokenizer = TreebankWordTokenizer()
theseTokens = thisTokenizer.tokenize(oneBigText.lower())
tokenCounts = Counter(theseTokens)

theseTokens = []
for text in corpus:
    theseTokens += [sorted(thisTokenizer.tokenize(text.lower()))]
allTokens = sum(theseTokens, [])   
#tokenCounts = Counter(theseTokens)
thisLex = sorted(set(allTokens))    
#initialize the vector to be used
zeroVector = OrderedDict((token,0) for token in thisLex)

Now calculate TF-IDF - remember the +1 for LaPlace smoothing

try:
    document_tfidf_vectors = []
    for doc in corpus:
        thisVector = copy.copy(zeroVector)
        #tokenize the doc and count the tokens
        theseTokens = thisTokenizer.tokenize(doc.lower())
        tokenCounts = Counter(theseTokens)
        #for each term in the vocab of the doce
        for k,v in tokenCounts.items():
            docsContainingKey = 0
            #go through all docs
            for _doc in corpus:
                if k in _doc:
                    docsContainingKey += 1
            #how frequent is the value, given the size of the lex
            tf = v/len(thisLex)
            if docsContainingKey:
                #what's inverse freqeunce - number of docs vs. docs containing the key
                idf = len(corpus) / (1+docsContainingKey)
            else:
                idf = 0
            thisVector[k] = tf * idf
        document_tfidf_vectors.append(thisVector)
except Exception as e:
    print("Clearly I screwed up something:",e)
    
print(pd.DataFrame.from_dict(document_tfidf_vectors))

     fish  rabbit  raccoon  turtle
0  0.9375    0.50    0.000    0.00
1  0.9375    0.25    0.000    0.25
2  0.1875    0.00    1.125    0.25


This doesn't arrive at exactly the same number, but maybe I missed a bit of secret sauce that TfidfVectorizer has - there are lots of adjustments that can be made to the basic calculations.