sklearn & pytorch: Train test split for neural net training in pipeline for a grid search

549 views Asked by At

I am working on a pretty big dataset, which we decided to groupKfold (as we got measurements in the dataset which shouldn't get split, but folded i k folds).

We then are grid searching sklearn models with the groupkfolded dataset and either RandomizedGridSearch or BayesenGridSearch. To use neural nets in this pipeline we decied to fit pytorch in the sklearn interface. For that we used from sklearn.base import BaseEstimator, ClassifierMixin.

Then we are setting up a pipeline:

class Neural_Net_Interface(ClassifierMixin, BaseEstimator):
    def __init__(self, X_test, y_test, Max_num_epochs, Early_Stopping, and so on...):
     self.....

    def fit(self, X_train, y_train):
       ...

    def predict(self, X):
       ...

pipeline_nn = Pipeline([('std', StandardScaler()),
                        ('splitter', train_test_split(X, y, test_size=0.2, random_state=69)),
                        ('nn', Neural_Net_Interface(X_test=X,
                                                    y_test=y,
                                                    Max_num_epochs=3,
                                                    Early_Stopping=True,
                                                    ... (20 more parameters))])

cv_object = GroupKFold(n_splits=np.max(group_vector) + 1)

model_grid_cv = BayesSearchCV(estimator=pipeline_nn,
                              search_spaces=search_space,
                              scoring=my_scorer,
                              optimizer_kwargs={'base_estimator': 'NN', 'n_initial_points': 20},
                              cv=cv_object,
                              n_jobs=N_JOBS,
                              verbose=100,
                              n_iter=N_ITER,
                              n_points=N_POINTS,
                              iid=False,
                              random_state=69)

model_grid_cv.fit(X, y, groups=groups)

And here comes the problem: As you can see above the NeuralNetInterface (sklearn classifier) awaits an input for a test X & y. This is because after each training epoch we need to evaluate the NN accuracy. I can't train test split the dataset once in the beginning, as this would undermine the sense of a kFold. So what I am trying to do is to define the pipeline in a way that the output of the train test split is passed to the neural net interface. This is not working.

Besides my real question is:

-The groupKFold folds 4 groups 4 times, taking 3 parts for training and one part for the score estimation. --> How can I adjust the pipeline in a way that the 1 out of 4 parts of the kFold is passed to the NeuralNetInterface so that this part is used for the NN evaluation? Do I need to adjust the NeuralNetInterface not taking a test set?? -Or is that not possible and I need to train test split the data in the GridSearch always passing one part to the NeuralNetInterface? How do I get that working?

I hope I described my question well enough to understand. Thanks for your help in advance!

Best regards

0

There are 0 answers