When working with text data, I understand the need to encode text labels into some numeric representation (i.e., by using LabelEncoder, OneHotEncoder etc.)
However, my question is whether you need to perform this step explicitly when you're using some feature extraction class (i.e. TfidfVectorizer, CountVectorizer etc.) or whether these will encode the labels under the hood for you?
If you do need to encode the labels separately yourself, are you able to perform this step in a Pipeline (such as the one below)
pipeline = Pipeline(steps=[
('tfidf', TfidfVectorizer()),
('sgd', SGDClassifier())
])
Or do you need encode the labels beforehand since the pipeline expects to fit() and transform() the data (not the labels)?
Have a look into the
scikit-learnglossary for the term transform:In fact, almost all transformers only transform the features. This holds true for
TfidfVectorizerandCountVectorizeras well. If ever in doubt, you can always check the return type of the transforming function (like thefit_transformmethod ofCountVectorizer).Same goes when you assemble several transformers in a pipeline. It is stated in its user guide:
So in conclusion, you typically handle the labels separately and before you fit the estimator/pipeline.