I have read many blogs but was not satisfied with the answers, Suppose I train tf-idf model on few documents example:
" John like horror movie."
" Ryan watches dramatic movies"
------------so on ----------
I use this function:
from sklearn.feature_extraction.text import TfidfTransformer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print((X_train_counts.todense()))
# Gives count of words in each document
But it doesn't tell which word? How to get words as headers in X_train_counts
outputs. Similarly in X_train_tfidf ?
So X_train_tfidf output will be matrix with tf-idf score:
Horror watch movie drama
doc1 score1 -- -----------
doc2 ------------------------
Is this correct?
What does fit
does and what does transformation
does?
In sklearn it is mentioned that:
fit(..) method to fit our estimator to the data and secondly the transform(..) method to transform our count-matrix to a tf-idf representation.
What does estimator to the data
means?
Now suppose new test document comes:
" Ron likes thriller movies"
How to convert this document to tf-idf? We can't convert it to tf-idf right?
How to handle word thriller
which is not there in train document.
taking two text as input
Now testing it for new comment , we need to use transform function , the word which are out of vocabulary will get ignored while vectorizing it.
if you want to use vocabulary of certain word, than prepare list of word that you want to use , and keep appending new word to this list and pass list to CountVectorizer