`**Split a Train a Test Dataset**
X_train, X_test, y_train, y_test = train_test_split(X_pre, y, random_state=0, stratify=y, train_size=training_fraction)
**Apply SMOTE or Some other Balancing Algorithm**
X_imputed_train_df, y_train = balancing_algorithm.fit_resample(X_imputed_train_df, y_train)
**Apply Sequential Feature Selection & HyperParameter Tuning**
lr1 = LogisticRegression(random_state=42, max_iter=250)
lr2 = LogisticRegression(random_state=42, max_iter=250)
sss = StratifiedShuffleSplit(n_splits=8, test_size=0.2, random_state=42)
sfsLR = SFS(estimator=lr1,
k_features='best',
forward=boolean_sfs,
floating=False,
scoring='f1',
cv=sss)
pipe_lr = Pipeline([('lr2',lr2)])
lr_param_grid = [{'lr2__penalty': ['l1', 'l2'],
'lr2__C': param_range_fl,
'lr2__solver': ['liblinear','lbfgs']}]
lr_grid_search = GridSearchCV(estimator=pipe_lr,
param_grid=lr_param_grid,
scoring='f1',
cv=sss)
grid_dict = {0: 'Logistic Regression'}
grids = [lr_grid_search]
SFSList = [sfsLR]
j=0
for pipe, sfs in zip(grids, SFSList):
# Fit the SFS to my Amplified Train Data (The one that has ADASYN/SMOTE samples) [This will also update the SFSList items]
sfs = sfs.fit(X_ADASYN3, labels6)
# Get selected feature indices
selected_feature_indices = list(sfs.k_feature_idx_)
# Create a DataFrame with selected features
sfsFinal = X_ADASYN3.iloc[:, selected_feature_indices]
# Rename columns to original feature names as they got deleted
sfsFinal.columns = X_ADASYN3.columns[selected_feature_indices]
# Fit the pipeline, basically accessing each model_grid_search item and fitting it to our data with the best features.
pipe.fit(sfsFinal, labels6)
print("I just finished calculating: " + grid_dict[j])
j+=1`
Thank you very much.
I am training this model, and it seems like I am applying SMOTE to the entire dataset, but I should be doing it for just the training data during my CV for both the Sequential Feature Selection, and the CV of the GridSearchCV. This is a must correct? How can I modify my code to do so?