When working with text data, I understand the need to encode text labels into some numeric representation (i.e., by using LabelEncoder
, OneHotEncoder
etc.)
However, my question is whether you need to perform this step explicitly when you're using some feature extraction class (i.e. TfidfVectorizer
, CountVectorizer
etc.) or whether these will encode the labels under the hood for you?
If you do need to encode the labels separately yourself, are you able to perform this step in a Pipeline
(such as the one below)
pipeline = Pipeline(steps=[
('tfidf', TfidfVectorizer()),
('sgd', SGDClassifier())
])
Or do you need encode the labels beforehand since the pipeline expects to fit()
and transform()
the data (not the labels)?
Have a look into the
scikit-learn
glossary for the term transform:In fact, almost all transformers only transform the features. This holds true for
TfidfVectorizer
andCountVectorizer
as well. If ever in doubt, you can always check the return type of the transforming function (like thefit_transform
method ofCountVectorizer
).Same goes when you assemble several transformers in a pipeline. It is stated in its user guide:
So in conclusion, you typically handle the labels separately and before you fit the estimator/pipeline.