I am working on a basic text classification problem, I want to use a stacking classifier along with some fine-tuning of the parameters of my base classifiers to get high-accuracy results.
My dataset has 8000 rows and 2 cols (text and class). The below piece of code seems to be stuck and I am not well versed in the field (beginner) to spot the problem.
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import NuSVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score, log_loss, classification_report, confusion_matrix
# Define parameter grids for classifiers
param_grid_nusvc = {
'nu': [0.1, 0.3, 0.5, 0.7, 0.9],
'kernel': ['linear', 'rbf'],
}
param_grid_logreg = {
'C': [0.1, 1, 10],
'penalty': ['l1', 'l2'],
}
# Perform grid search for classifiers with improved clarity
nusvc_grid_search = GridSearchCV(NuSVC(probability=True), param_grid_nusvc, cv=2, scoring='accuracy') # Use accuracy scoring
logreg_grid_search = GridSearchCV(LogisticRegression(), param_grid_logreg, cv=2, scoring='accuracy')
nusvc_grid_search.fit(X_train, y_train)
logreg_grid_search.fit(X_train, y_train)
# Get best parameters
best_params_nusvc = nusvc_grid_search.best_params_
best_params_logreg = logreg_grid_search.best_params_
# Set up base classifiers with best parameters
best_nusvc = NuSVC(probability=True, **best_params_nusvc)
best_logreg = LogisticRegression(**best_params_logreg)
# Setting up stacking classifier
sc = StackingClassifier(
estimators=[
('NuSVC', best_nusvc),
('LDA', LinearDiscriminantAnalysis())
],
final_estimator=best_logreg
)
sc.fit(X_train, y_train)
# Evaluate the combined classifiers
print('****Results****')
train_predictions = sc.predict(X_test)
acc = accuracy_score(y_test, train_predictions)
print("Accuracy: {:.4%}".format(acc))
train_predictions_proba = sc.predict_proba(X_test)
ll = log_loss(y_test, train_predictions_proba)
print("Log Loss: {}".format(ll))
# Print classification report (optional)
print('\nClassification Report:')
print(classification_report(y_test, train_predictions))
# Print confusion matrix (optional)
print('\nConfusion Matrix:')
print(confusion_matrix(y_test, train_predictions))
some changes in the above have been made from the advice from chatGPT to guide me on how to fine tune using grid search. The code seems to be stuck (about 20 mins). Without the grid search it seemed to run in around 2-3 mins easily.
Your SVC grid has 5×2 points, each fitted for 2 folds, so that should take about 20× as long. You can set
verbose=4in the searches to better track what's happening, and consider parallelizing (n_jobs=-1for example).