Sklearn : ValueError feature shape during training is different than feature shape during validation

30 views Asked by At

I'm trying to use sklearn to build a custom Pipeline for a school project that uses ML to analyze text. I have established some logging into my custom Transformers and am encountering an issue that got me stuck for a week if not more. It's the following :

ValueError: X has 7930 features, but SelectKBest is expecting 25050 features as input.

Essentially my process is the following :

  1. Gather features before preprocessing (when I need the words and punctuation to be present and unchanged for some feature extraction)

  2. Apply preprocessing

  3. Gather features after preprocessing

Example transformers :

class FeatureExtractorBeforePreprocessing(BaseEstimator, TransformerMixin):
    """
    Extracts all features combined before preprocessing.
    """
    def __init__(self, stopWords=False, errorDetector=False, punctuationFrequency=False, sentenceLength=False):
        """
        Initialize the FeatureExtractorBeforePreprocessing.
        
        Args:
            stopWords (bool): Whether to include stop words as a feature.
            errorDetector (bool): Whether to include error detection as a feature.
            punctuationFrequency (bool): Whether to include punctuation frequency as a feature.
            sentenceLength (bool): Whether to include sentence length as a feature.
        """
        self.stopWords = stopWords
        self.errorDetector = errorDetector
        self.punctuationFrequency = punctuationFrequency
        self.sentenceLength = sentenceLength
        self.feature_union = []
        self.combined_transformers = None

        # Add transformers based on selected options
        if self.stopWords:
            self.feature_union.append(("stopWords", StopWords()))
        if self.errorDetector:
            self.feature_union.append(("errorDetector", ErrorDetector()))
        if self.punctuationFrequency:
            self.feature_union.append(("punctuationFrequency", PunctuationFrequency()))
        if self.sentenceLength:
            self.feature_union.append(("sentenceLength", SentenceLength()))
        if self.feature_union:
            self.combined_transformers = FeatureUnion(self.feature_union)
        
    def fit(self, X, y=None):
        # Fit the combined transformers if they exist
        if self.combined_transformers:
            self.combined_transformers.fit(X)
        return self

    def transform(self, X):
        if self.feature_union:
            logger.info("Extracting features before preprocessing..")
            combined_features = self.combined_transformers.transform(X)
            logger.info(f"Shape of combined features before preprocessing: {combined_features.shape}")
            return combined_features
        else:
            X_array = np.empty((len(X), 0))
            logger.info("Nothing to extract before preprocessing..")
            logger.info(f"Shape of combined features before preprocessing: {X_array.shape}")
            return X_array

class FeatureExtractorAfterPreprocessing(BaseEstimator, TransformerMixin):
   
    def __init__(self, config=config, textWordCounter=False, wordLength=False, vocabularySize=False):
        self.config = config
        self.textWordCounter = textWordCounter
        self.wordLength = wordLength
        self.vocabularySize = vocabularySize
        self.feature_union = []
        
        if self.textWordCounter:
            self.feature_union.append(("textWordCounter", TextWordCounter(self.config.getboolean("TextWordCounter","freqDist"), self.config.getboolean("TextWordCounter", "bigrams"))))
        
        if self.vocabularySize:
            self.feature_union.append(("vocabularySize", VocabularySize()))
        
        self.combined_transformers = FeatureUnion(self.feature_union)
        
    def fit(self, X, y=None):
        if self.combined_transformers:
            self.combined_transformers.fit(X)
        return self

    def transform(self, X):
        if self.feature_union:
            logger.info("Extracting features after preprocessing..")
            combined_features = self.combined_transformers.transform(X)
            logger.info(f"Shape of combined features after preprocessing: {combined_features.shape}")
            return combined_features
        else:
            X_array = np.empty((len(X), 0))
            logger.info("Nothing to extract before preprocessing..")
            logger.info(f"Shape of combined features after preprocessing: {X_array.shape}")
            return X_array

I have the logging process here :

2024-03-21 20:31:55,868 [INFO] Creating custom pipeline...
2024-03-21 20:31:55,868 [INFO] pipeline config: {'stopWords': False, 'errorDetector': False, 'punctuationFrequency': False, 'sentenceLength': False, 'textWordCounter': True, 'wordLength': False, 'vocabularySize': False, 'featureSelector': SelectKBest(k=10000, score_func=<function chi2 at 0x000001ACCFDC4040>)}
2024-03-21 20:31:55,868 [INFO] Creating pipeline...
2024-03-21 20:31:55,870 [INFO] Pipeline config: [('featureExtractionUnion', FeatureUnion(transformer_list=[('featureExtractionBeforePreprocessing',
                                FeatureExtractorBeforePreprocessing()),
                               ('afterPreprocessingPipeline',
                                Pipeline(steps=[('preprocessing',
                                                 Preprocessing()),
                                                ('featureExtractionAfterPreprocessing',
                                                 FeatureExtractorAfterPreprocessing(textWordCounter=True))]))])), ('featureSelection', SelectKBest(k=10000, score_func=<function chi2 at 0x000001ACCFDC4040>))]
2024-03-21 20:31:55,870 [INFO] Pipeline created.
2024-03-21 20:31:55,870 [INFO] Selected classifier: svm
2024-03-21 20:31:55,870 [INFO] Custom pipeline created.
2024-03-21 20:31:55,870 [INFO] UserConfigPipeline initialized.
2024-03-21 20:31:55,873 [INFO] Training custom model...
2024-03-21 20:31:55,873 [INFO] Nothing to extract before preprocessing..
2024-03-21 20:31:55,873 [INFO] Shape of combined features before preprocessing: (334, 0)
2024-03-21 20:31:55,873 [INFO] Preprocessing..
2024-03-21 20:31:55,874 [INFO] No preprocessing applied..
2024-03-21 20:31:55,874 [INFO] Returned list X of 334 texts.
2024-03-21 20:31:55,874 [INFO] Extracting features after preprocessing..
2024-03-21 20:31:55,960 [INFO] Extracting freqDict features..
2024-03-21 20:31:55,990 [INFO] Shape of freqDict features: (334, 25050)
2024-03-21 20:31:55,991 [INFO] Shape of combined features after preprocessing: (334, 25050)
2024-03-21 20:31:56,166 [INFO] Successfully trained custom model...
2024-03-21 20:31:56,166 [INFO] Validating model...
2024-03-21 20:31:56,166 [INFO] Predicting evaluation set...
2024-03-21 20:31:56,166 [INFO] Nothing to extract before preprocessing..
2024-03-21 20:31:56,166 [INFO] Shape of combined features before preprocessing: (72, 0)
2024-03-21 20:31:56,166 [INFO] Preprocessing..
2024-03-21 20:31:56,166 [INFO] No preprocessing applied..
2024-03-21 20:31:56,166 [INFO] Returned list X of 72 texts.
2024-03-21 20:31:56,166 [INFO] Extracting features after preprocessing..
2024-03-21 20:31:56,185 [INFO] Extracting freqDict features..
2024-03-21 20:31:56,195 [INFO] Shape of freqDict features: (72, 7930)
2024-03-21 20:31:56,196 [INFO] Shape of combined features after preprocessing: (72, 7930)


Now as I understand it, it should be absolutely normal for the validation set to contain less features than the training set since it is a smaller portion of the whole dataset. I have split them 70% training, 15% validation, 15% testing. Now, I apply the same pipeline to both sets (training and validation) and I am fitting the pipeline during the training process before using the .predict method during validation. Does anyone have a hint on what could be causing this issue?

As I understood it, it should be a built-in feature that it takes into account the fact it gets new values that are not necessarily the same features it got during training, and therefore, should set the missing values to 0 or an "absent" value. Am I missing something ?

Thanks for your time and hopefully someone can lend me a hand with this :)

0

There are 0 answers