How to use function for strings using Dask?

1.2k views Asked by At

I have a big data set and recently got introduced to Dask. I am trying to tokenise the text in each row. This is very easy to do in pandas as shown below, but I having an error saying

AttributeError: 'DataFrame' object has no attribute 'lower' when I try to use Dask ( see the second group of codes below)

import pandas as pd
import dask 
import dask.dataframe as dd

 def to_lower(text):
        return text.lower()

df_2016 = pd.read_csv("2016_Cleaned_DroppedDup.csv")
df_2016['token2'] = df_2016['token2'].apply(lambda x: pr.to_lower(x))

With DASK:

df_2016 = dd.from_pandas(df_2016, npartitions = 4 * multiprocessing.cpu_count())
df_2016 = df.2016.map_partitions.(lambda df: df.apply(lambda x: pr.to_lower(x))).compute(scheduler = 'processes')
1

There are 1 answers

2
jtorca On BEST ANSWER

I would recommend in the future providing code that creates a dataframe so no one has to guess what your data actually looks like. But I think this case was simple enough. Also, I think there were syntax errors in the code you did provide, e.g., df.2016.map_partitions should be df_2016.map_partitions. Also, it is not clear what the pr object is in your code.

Given these errors, I just rewrote what I would do to operate on strings in dask and pandas using the .str method in a minimum working example similar to your setting. There is very little difference in syntax between pandas and dask for this.

EDIT: Added a user supplied function (to_lower) to give an example using .apply in dask.

import pandas as pd
import dask.dataframe as dd

def to_lower(text):
    return text.lower()

# using pandas
df_2016 = pd.DataFrame({'token2':['HI']*100 + ['YOU']*100})
df_2016['token2_low'] = df_2016['token2'].str.lower()
df_2016['token2_low_apply'] = df_2016['token2'].apply(to_lower)
df_2016
    token2 token2_low token2_low_apply
0       HI         hi               hi
1       HI         hi               hi
2       HI         hi               hi
3       HI         hi               hi
4       HI         hi               hi
..     ...        ...              ...
195    YOU        you              you
196    YOU        you              you
197    YOU        you              you
198    YOU        you              you
199    YOU        you              you

[200 rows x 3 columns]
# using dask
ddf_2016 = dd.from_pandas(df_2016[['token2']], npartitions=10)
ddf_2016['token2_low'] = ddf_2016['token2'].str.lower()
ddf_2016['token2_low_apply'] = ddf_2016['token2'].apply(to_lower, meta=('token2', 'object'))

ddf_2016.compute()
    token2 token2_low token2_low_apply
0       HI         hi               hi
1       HI         hi               hi
2       HI         hi               hi
3       HI         hi               hi
4       HI         hi               hi
..     ...        ...              ...
195    YOU        you              you
196    YOU        you              you
197    YOU        you              you
198    YOU        you              you
199    YOU        you              you

[200 rows x 3 columns]