Custom classifier won't accept data from test_train_split in sklearn

50 views Asked by At

I am attempting to write a custom classifier for use in a sklearn gridsearchCV pipeline.

I've stripped everything back to the bare minimum in the class which currently looks like this:

from sklearn.base import BaseEstimator, ClassifierMixin
import pandas as pd

class DifferentialMethylation(BaseEstimator, ClassifierMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return self

In my main code, I have this:

    X_train, X_test, y_train, y_test = train_test_split(df, cancerType, test_size=0.2, random_state=42)
    
    differentialMethylation = DifferentialMethylation()
    feature_selection = RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
    randomForest = RandomForestClassifier(random_state=42)

    # Create the pipeline with feature selection and model refinement
    pipeline = Pipeline([
        ('differentialMethylation', differentialMethylation),
        ('featureSelection', featureSelection),
        ('modelRefinement', randomForest)
    ])

    search = GridSearchCV(pipeline,
                        param_grid=parameterGrid,
                        scoring='accuracy',
                        cv=5,
                        verbose=0,
                        n_jobs=-1,
                        pre_dispatch='2*n_jobs')  

    search.fit(X_train, y_train)

If I remove the custom classifier from the pipeline, so that the pipeline looks like this:

   pipeline = Pipeline([
        ('featureSelection', featureSelection),
        ('modelRefinement', randomForest)
    ])

it runs happily. If I add that line back in, I get:

ValueError: Expected 2D array, got scalar array instead:
array=DifferentialMethylation().
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

the X_train is a two dimensional data frame - X_train.shape: (679, 369), y_train.shape: (679,). As best I can tell, the stripped back classifier .fit() method should be acting as a pass-through method, leaving the data unchanged, so I have no idea why the output of the test_train_split is being accepted by the RFE in the featureSelection classifier, but not in the differentialMethylation.

Unless there's some obscure piece of lore in the sklearn documentation about transforming input data for custom classifiers that I've missed.

Thoughts as to what's going on would be appreciated.

1

There are 1 answers

1
Ben On

In the documentation, in the very obvious section :

Developer API for set_output

I see that the transform method doesn't return self, it returns X.

So it wasn't that the custom classifier wouldn't accept the dataframe, it's that when it was trying to feed it through as input for the next stage, returning self resulting in the next step generating the

ValueError: Expected 2D array, got scalar array instead:
array=DifferentialMethylation().

In retrospect, I can see how that error makes sense, but it was not an easy path to understanding.