gridsearchcv with tfidf and count vectorizer

6k views Asked by At

I want to use GridSearchCV for parameter tuning. Is it also possible to check with GridSearchCV whether CountVectorizer or TfidfVectorizer works best? My idea:

pipeline = Pipeline([
           ('vect', TfidfVectorizer()),
           ('clf', SGDClassifier()),
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2), (1,3),  
'tfidf__use_idf': (True, False),
'tfidf__norm': ('l1', 'l2', None),
'clf__max_iter': (20,),
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
'clf__max_iter': (10, 50, 80),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=5)

My idea: CountVectorizer is the same as TfidfVectorizer with use_idf=False and normalize=None. If GridSearchCV gives this as the best result those parameters, then CountVectorizer is the best option. Is that correct?

Thank you in advance :)

1

There are 1 answers

4
yatu On BEST ANSWER

Once you've included a given step with its corresponding name in the Pipeline, you can access it from the parameter grid and add other parameters, or vectorizers in this case, in the grid. You can also have a list of grids in a single pipeline:

from sklearn.feature_extraction.text import CountVectorizer

pipeline = Pipeline([
           ('vect', TfidfVectorizer()),
           ('clf', SGDClassifier()),
])
parameters = [{
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2), (1,3),)  
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2', None),
    'clf__max_iter': (20,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    'clf__max_iter': (10, 50, 80)
},{
    'vect': (CountVectorizer(),)
    # count_vect_params...
    'clf__max_iter': (20,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    'clf__max_iter': (10, 50, 80)
}]

grid_search = GridSearchCV(pipeline, parameters)