I have a data frame with 60 numerical columns (dtype=float64) and 1 categorical column (dtype=object). I resolved NAN, missing, infinite values before applying them to the TFDF random forest model. However, I get the following error, ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float). So, I tried converting every column of a data frame to a tensor, which worked fine.

# Extract one column from the DataFrame
column_values = valid_data['column_name'].values

# Convert the column values to a TensorFlow tensor
tensor_column = tf.convert_to_tensor(column_values)

However, when I try to get the entire data frame to tensor or pass the entire data frame to the TFDF random forest model the above unsupported error occurs: data= tf.convert_to_tensor(valid_data)

The data is very similar to the below link: https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv

Any suggestions would be appreciated. Thank you

1

There are 1 answers

3
rstz On

TF-DF author here. It’s not easy to solve this issue without seeing the data, so I’m going to make a few guesses.

  • Your categorical column may contain both float values and string values. This is not supported by TF-DF. You can work around this and just cast the entire column to string: df["categorical_column"] = df["categorical_column"].astype(str)
  • Maybe one of your float columns somehow got assigned dtype object and it cannot be resolved automatically by pandas. Very often, this can be solved by setting df["float_column"] = df["float_column"].astype(np.float32). If you’re dealing with pure numpy arrays (as in the second part of your question), x = np.asarray(x).astype('float32') should do the trick, see the answers to this question for details.

Note that TF-DF does support heterogeneous datasets out of the box. NaN / missing values are also handled natively by the model (see theses hyperparameters for details 1, 2 to control how this is done) and categorical columns are also handled natively (details here. Importantly, one-hot encoding categorical columns will likely hurt model quality and is not recommended.

The only clean-up that’s necessary is getting rid of infinite values and make sure categorical columns do not contain float values:

def replace_inf(df):
    df = df.replace([np.inf],1e30)
    return df.replace([-np.inf],-1e30)
result_df = replace_inf(df.copy())
result_df["categorical_column"] = result_df["categorical_column"].astype(str)
tfdf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(x, label="mylabel")

If you think there’s a bug with the tfdf.keras.pd_dataframe_to_tf_dataset, please report it on the project’s Github repository.