I want to use GridSearchCV for parameter tuning. Is it also possible to check with GridSearchCV whether CountVectorizer or TfidfVectorizer works best? My idea:
pipeline = Pipeline([
('vect', TfidfVectorizer()),
('clf', SGDClassifier()),
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2), (1,3),
'tfidf__use_idf': (True, False),
'tfidf__norm': ('l1', 'l2', None),
'clf__max_iter': (20,),
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
'clf__max_iter': (10, 50, 80),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=5)
My idea: CountVectorizer is the same as TfidfVectorizer with use_idf=False and normalize=None. If GridSearchCV gives this as the best result those parameters, then CountVectorizer is the best option. Is that correct?
Thank you in advance :)
Once you've included a given step with its corresponding name in the
Pipeline
, you can access it from the parameter grid and add other parameters, or vectorizers in this case, in the grid. You can also have a list of grids in a single pipeline: