I analyse a corpus of lines:
corpus = ['rabbit rabbit fish fish fish fish fish',
'turtle rabbit fish fish fish fish fish',
'raccoon raccoon raccoon fish fish fish fish fish']
For TF*IDF calculation I run a code as follows:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)
The result is
Why the result has such a big value for "fish". This is the common word, and according to TF-IDF, it should be zero since every document contains this word.
If you establish "fish" as a commonly used word, but then use "fish" five times in a seven word sentence, TF-IDF is still going to give you a high score for fish. That's because TF-IDF is Term Frequency multiplied by Inverse Document Freqency.
Just change up your last sentence like this and rerun
Now you'll see what TF-IDF is doing for you - in the last sentence it gives fish a lower value than turtle even though they are both used once, because fish is a more commonly used word in the corpus.
This is a bit messy, but here's the process using basic math
Tokenizing
Now calculate TF-IDF - remember the +1 for LaPlace smoothing
This doesn't arrive at exactly the same number, but maybe I missed a bit of secret sauce that TfidfVectorizer has - there are lots of adjustments that can be made to the basic calculations.