Filter NaN values in Tensorflow dataset

2.2k views Asked by At

Is there an easy way to filter all entries containing a nan value from a tensorflow.data.Dataset instance? Like the dropna method in Pandas?


Short example:

import numpy as np
import tensorflow as tf

X = tf.data.Dataset.from_tensor_slices([[1,2,3], [0,0,0], [np.nan,np.nan,np.nan], [3,4,5], [np.nan,3,4]])
y = tf.data.Dataset.from_tensor_slices([np.nan, 0, 1, 2, 3])
ds = tf.data.Dataset.zip((X,y))
ds = foo(ds)  # foo(x) = ?
for x in iter(ds): print(str(x))

What can I use for foo(x) to get the following output:

(<tf.Tensor: shape=(3,), dtype=float32, numpy=array([0., 0., 0.], dtype=float32)>, <tf.Tensor: shape=(), dtype=float32, numpy=0.0>)
(<tf.Tensor: shape=(3,), dtype=float32, numpy=array([3., 4., 5.], dtype=float32)>, <tf.Tensor: shape=(), dtype=float32, numpy=2.0>)

If you want to try for yourself, here is Google Colab notebook.

3

There are 3 answers

0
Nicolas Gervais - Open to Work On BEST ANSWER

I had a slightly different approach than the existing answer. Rather than using sum, I'm using tf.reduce_any:

filter_nan = lambda x, y: not tf.reduce_any(tf.math.is_nan(x)) and not tf.math.is_nan(y)

ds = tf.data.Dataset.zip((X,y)).filter(filter_nan)

list(ds.as_numpy_iterator())
[(array([0., 0., 0.], dtype=float32), 0.0),
 (array([3., 4., 5.], dtype=float32), 2.0)]
0
TayTay On

How about:

def any_nan(t):
    return tf.reduce_sum(
        tf.cast(
            tf.math.is_nan(t),
            tf.int32,
        )
    ) > tf.constant(0)


>>> ds_filtered = ds.filter(lambda x, y: not any_nan(x) and not any_nan(y))
>>> for x in iter(ds_filtered): print(str(x))
(<tf.Tensor: shape=(3,), dtype=float32, numpy=array([0., 0., 0.], dtype=float32)>, <tf.Tensor: shape=(), dtype=float32, numpy=0.0>)
(<tf.Tensor: shape=(3,), dtype=float32, numpy=array([3., 4., 5.], dtype=float32)>, <tf.Tensor: shape=(), dtype=float32, numpy=2.0>)
0
Mathemilda On

There is some trade-off between number of columns and Machine Learning model performance, and it is better to keep as many feature columns as you can. I recommend at first to drop all rows with NAN values in target variable, and then remove columns with remaining NAN values among other variables. This way if your missing entries in a feature variable agree with missing target values you might keep more columns than just by removing everything with missing entries. The advice is general and you may choose to keep some feature columns which miss very few entries and instead of it remove rows with these entries.

Here I assume that in your tensor you have features and target variable as columns and y_indx is a column index of your target. You can use a suitable boolean mask as well.

The following function drops rows with nan values in target column from tensor X. It returns a boolean mask for removed rows, but you can skip it from the last row if you do not want to keep it.

def drop_na_rows(X, y_indx):
    not_nan = tf.math.logical_not(tf.math.is_nan(X[:, y_indx]))
    return X[not_nan, :], no_nan

The function below drops columns with missing values. Note that for cols_to_drop parameter you can use a boolean mask or indices suitable for slicing a tensor, but if you do not provide anything the function returns a boolean mask for columns. Or you can skip it, too.

def drop_na_cols(X, cols_to_drop=None):
    if cols_to_drop is None:
        cols_to_drop = tf.where(tf.reduce_sum(X, axis=1))
    return X[:, cols_to_drop], cols_to_drop