PandasNotImplementedError : Using nested np.where() in a Koalas DataFrame returns error

587 views Asked by At

I am converting code written with Pandas to Koalas, but I'm coming across the error with use of numpy where:

import pandas as pd
import numpy as np
import databricks.koalas as ks

data = {'credit': [123.23, 23423.56, 0, 0], 'debit': [0, 0, 234.21, 95.32]}

df = ks.DataFrame(data)

df['flag'] = np.where(
    df['credit'] != '',
    'C',
    np.where(
        df['debit'] != '',
        'D',
        ''
    )
)

Returns the error:

PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.

I run out of memory if I try to convert the Koalas dataframe with to_numpy() or toPandas() to keep the code as is. This code has a lot of nested np.where() statements in it, as well as many other uses of numpy I would very much like not to rewrite.

I'm unclear if there's a simple way to keep these np.where() (or any other numpy statement) in the code with use of a koalas dataframe.

I am aware there is a way to mimic np.where() using df.assign(flag=()) but I am unclear how to use that method to mimic a nested condition. My attempt below:

# works but does not include the second condition
df = df.assign(flag= df.debit.apply(lambda x: "D" if x != "" else "")

# Does not work and returns an error
df = test_df.assign(flag= df.debit.apply(
  lambda x: "D" if x != "" else (
    df.credit.apply(
      lambda x: "C" if x != "" else ""))))

Error: PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects

1

There are 1 answers

0
G.G On
def function1(ss:ks.Series):
    if ss.credit!= 0:
        return 'C'
    elif ss.debit!= 0:
        return 'D'
    else:
        return ''

df.apply(function1,axis=1)

out:

0    C
1    C
2    D
3    D
dtype: object