Background
My sentiment analysis research comes across a variety of datasets. Recently I've encountered one dataset that somehow I just cannot train successfully. I mostly work with open data in .CSV
file format, hence Pandas
and NumPy
are heavily used.
During my research, one of the approaches is trying to integrate automated machine learning (AutoML
), and the library I chose to use was Auto-Keras
, mainly using its TextClassifier()
wrapper function to achieve AutoML
.
Main Problem
I've verified with official documentation, that the TextClassifier()
takes data in the format of the NumPy array. However, when I load the data into Pandas DataFrame
and used .to_numpy()
on the columns that I need to train, the following error kept showing:
ValueError Traceback (most recent call last)
<ipython-input-13-1444bf2a605c> in <module>()
16 clf = ak.TextClassifier(overwrite=True, max_trials=2)
17
---> 18 clf.fit(x_train, y_train, epochs=3, callbacks=cbs)
19
20
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).
Error-related code sectors
The sector where I drop the unneeded Pandas DataFrame
columns using .drop()
, and convert the needed columns to NumPy
Array using the to_numpy()
function that Pandas
has provided.
df_src = pd.read_csv(get_data)
df_src = df_src.drop(columns=["Name", "Cast", "Plot", "Direction",
"Soundtrack", "Acting", "Cinematography"])
df_src = df_src.reset_index(drop=True)
X = df_src["Review"].to_numpy()
Y = df_src["Overall Sentiment"].to_numpy()
print(X, "\n")
print("\n", Y)
The main error code part, where I perform StratifedKFold()
and at the same time, use TextClassifier()
to train and test the model.
fold = 0
for train, test in skf.split(X, Y):
fold += 1
print(f"Fold #{fold}\n")
x_train = X[train]
y_train = Y[train]
x_test = X[test]
y_test = Y[test]
cbs = [tf.keras.callbacks.EarlyStopping(patience=3)]
clf = ak.TextClassifier(overwrite=True, max_trials=2)
# The line where it indicated the error.
clf.fit(x_train, y_train, epochs=3, callbacks=cbs)
pred = clf.predict(x_test) # result data type is in lists of `string`
ceval = clf.evaluate(x_test, y_test)
metrics_test = metrics.classification_report(y_test, np.array(list(pred), dtype=int))
print(metrics_test, "\n")
print(f"Fold #{fold} finished\n")
Supplementary
I am sharing the full code related to the error through Google Colab
, which you can help me diagnose here.
Edit notes
I have tried the potential solution, such as:
x_train = np.asarray(x_train).astype(np.float32)
y_train = np.asarray(y_train).astype(np.float32)
or
x_train = tf.data.Dataset.from_tensor_slices((x_train,))
y_train = tf.data.Dataset.from_tensor_slices((y_train,))
However, the problem remains.
One of the strings is equal to
nan
. Just remove this entry and the corresponding label.