I am converting code written with Pandas to Koalas, but I'm coming across the error with use of numpy where:
import pandas as pd
import numpy as np
import databricks.koalas as ks
data = {'credit': [123.23, 23423.56, 0, 0], 'debit': [0, 0, 234.21, 95.32]}
df = ks.DataFrame(data)
df['flag'] = np.where(
df['credit'] != '',
df['debit'] != '',
Returns the error:
PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
I run out of memory if I try to convert the Koalas dataframe with to_numpy()
or toPandas()
to keep the code as is. This code has a lot of nested np.where() statements in it, as well as many other uses of numpy I would very much like not to rewrite.
I'm unclear if there's a simple way to keep these np.where()
(or any other numpy statement) in the code with use of a koalas dataframe.
I am aware there is a way to mimic np.where()
using df.assign(flag=())
but I am unclear how to use that method to mimic a nested condition. My attempt below:
# works but does not include the second condition
df = df.assign(flag= df.debit.apply(lambda x: "D" if x != "" else "")
# Does not work and returns an error
df = test_df.assign(flag= df.debit.apply(
lambda x: "D" if x != "" else (
lambda x: "C" if x != "" else ""))))
Error: PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects