I have a data frame with 60 numerical columns (dtype=float64) and 1 categorical column (dtype=object). I resolved NAN, missing, infinite values before applying them to the TFDF random forest model. However, I get the following error, ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float). So, I tried converting every column of a data frame to a tensor, which worked fine.
# Extract one column from the DataFrame
column_values = valid_data['column_name'].values
# Convert the column values to a TensorFlow tensor
tensor_column = tf.convert_to_tensor(column_values)
However, when I try to get the entire data frame to tensor or pass the entire data frame to the TFDF random forest model the above unsupported error occurs:
data= tf.convert_to_tensor(valid_data)
The data is very similar to the below link: https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv
Any suggestions would be appreciated. Thank you
TF-DF author here. It’s not easy to solve this issue without seeing the data, so I’m going to make a few guesses.
df["categorical_column"] = df["categorical_column"].astype(str)
df["float_column"] = df["float_column"].astype(np.float32)
. If you’re dealing with pure numpy arrays (as in the second part of your question),x = np.asarray(x).astype('float32')
should do the trick, see the answers to this question for details.Note that TF-DF does support heterogeneous datasets out of the box. NaN / missing values are also handled natively by the model (see theses hyperparameters for details 1, 2 to control how this is done) and categorical columns are also handled natively (details here. Importantly, one-hot encoding categorical columns will likely hurt model quality and is not recommended.
The only clean-up that’s necessary is getting rid of infinite values and make sure categorical columns do not contain float values:
If you think there’s a bug with the
tfdf.keras.pd_dataframe_to_tf_dataset
, please report it on the project’s Github repository.