How will the Imputers work if all the values in a column is missing in input vector in sklearn

232 views Asked by At

I have a dataset with large number of columns, I have programmed my application in such a way that if any value for the given columns is missing then it would filled with imputer values with mean as the imputer strategy.

However, I am bit concerned that if all the values of the entire column is missing then how would the imputer perform, and what would be the right approach in such a case?

1

There are 1 answers

2
KevinD On

If in a given column, all data is missing, then the Imputer will discard that column.

Here is an example, with 4 samples and 2 columns, with one sample having a missing value:

X = np.array([[1,1],[1,2],[1,1],[1,2],[1,np.nan]])
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
print(imputer.fit_transform(X))

This prints out

 [[ 1.   1. ]
 [ 1.   2. ]
 [ 1.   1. ]
 [ 1.   2. ]
 [ 1.   1.5]]

However, if all data in the second column is missing:

X = np.array([[1,np.nan],[1,np.nan],[1,np.nan],[1,np.nan],[1,np.nan]])
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
print(imputer.fit_transform(X))

We obtain:

[[ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]]

This default behaviour could be the right approach in that case, because this colums (i.e this feature) cannot be used anyway.