Random projection in Python Pandas using a dataframe containing NaN values

1.1k views Asked by At

I have a dataframe data containing real values and some NaN values. I'm trying to perform locality sensitive hashing using random projections to reduce the dimension to 25 components, specifically with thesklearn.random_projection.GaussianRandomProjection class. However, when I run:

tx = random_projection.GaussianRandomProjection(n_components = 25) data25 = tx.fit_transform(data)

I get Input contains NaN, infinity or a value too large for dtype('float64'). Is there a work-around to this? I tried changing all the NaN values to a value that is never present in my dataset, such as -1. How valid would my output be in this case? I'm not an expert behind the theory of locality sensitive hashing/random projections so any insight would be helpful as well. Thanks.

1

There are 1 answers

0
Andreus On

NA / NaN values (not-available / not-a-number) are, I have found, just plain troublesome.

You don't want to just substitute a random value like -1. If you are inclined to do that, use one of the Imputer classes. Otherwise, you are likely to very substantially change the distances between points. You likely want to preserve distances as much as possible if you are using random projection:

The dimensions and distribution of random projections matrices are controlled so as to preserve the pairwise distances between any two samples of the dataset.

However, this may or may not result in reasonable values for learning. As far as I know, imputation is an open field of study, which (for instance) this gentlemen has specialized in studying.

If you have enough examples, consider dropping rows or columns that contain NaN values. Another possibility is training a generative model like a Restricted Boltzman Machine and use that to fill in missing values:

rbm = sklearn.neural_network.BernoulliRBM().fit( data_with_no_nans )
mean_imputed_data = sklearn.preprocessing.Imputer().fit_transform( all_data )
rbm_imputation = rbm.gibbs( mean_imputed_data )
nan_mask = np.isnan( all_data )
all_data[ nan_mask ] = rbm_imputation[ nan_mask ]

Finally, you might consider imputing using nearest neighbors. For a given column, train a nearest neighbors model on all the variables except that column using all complete rows. Then, for a row missing that column, find the k nearest neighbors and use the average value among them. (This gets very costly, especially if you have rows with more than one missing value, as you will have to train a model for every combination of missing columns).