How to impute columns with categorial datatype in scikit-learn

445 views Asked by At

I have a dataset that includes both numeric and object in the features. Additionally some of the features with object datatype have missing values. I created a modified version of Imputer (following the instructions on another post) to take care of the missing value for both numeric and categorial datatype but when I apply to my dataset it returns AttributeError. I believe I am making a silly mistake in the definition of fit method for the impute and i appreciate your insight. Here is the my code and the error:

import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer

#load the data
path='~/Desktop/ML/Hands_on/housing_train.csv'
path=os.path.expanduser(path)
data=pd.read_csv(path)

#select the columns_names including dtype=object && missing data
object_data=data.select_dtypes(include=['object'])
object_data_null=[]
for col in object_data.columns:
    if object_data[col].isnull().any():
        object_data_null.append(col)

class GeneralImputer(Imputer):
    def __init__(self, **kwargs):
        Imputer.__init__(self, **kwargs)

    def fit(self, X, y=None):
        if self.strategy == 'most_frequent':
            self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
            self.statistics_ = self.fills.values
            return self
        else:
            return Imputer.fit(self, X, y=y)

    def transform(self, X):
        if hasattr(self, 'fills'):
            return pd.DataFrame(X).fillna(self.fills).values.astype(str)
        else:
            return Imputer.transform(self, X)

imputer=GeneralImputer(strategy='most_frequent', axis=1)

for i in object_data_null:
    imputer.fit(data[i])
    data[i]=imputer.transform(data[i])


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-29-989e78355872> in <module>()
     38 object_data_null
     39 for i in object_data_null:
---> 40     imputer.fit(data[i])
     41     data[i]=imputer.transform(data[i])
     42 

<ipython-input-29-989e78355872> in fit(self, X, y)
     23         if self.strategy == 'most_frequent':
     24             self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
---> 25             self.statistics_ = self.fills.values
     26             return self
     27         else:

AttributeError: 'str' object has no attribute 'values'
1

There are 1 answers

1
Vivek Kumar On BEST ANSWER

For a 1-sized object the squeeze() method will return a scaler object as mentioned in the documentation

So that means, for most of the time (which happens for all columns here), the mode of a column will be a single object and then the squeeze() will return just the string.

So no need to get .values after it. Change your fit() method to remove that:

def fit(self, X, y=None):
    if self.strategy == 'most_frequent':
        self.fills = pd.DataFrame(X).mode(axis=0).squeeze()

        # Removed .values from the below line
        self.statistics_ = self.fills
        return self