I have a big data set and recently got introduced to Dask. I am trying to tokenise the text in each row. This is very easy to do in pandas as shown below, but I having an error saying
AttributeError: 'DataFrame' object has no attribute 'lower' when I try to use Dask ( see the second group of codes below)
import pandas as pd
import dask
import dask.dataframe as dd
def to_lower(text):
return text.lower()
df_2016 = pd.read_csv("2016_Cleaned_DroppedDup.csv")
df_2016['token2'] = df_2016['token2'].apply(lambda x: pr.to_lower(x))
With DASK:
df_2016 = dd.from_pandas(df_2016, npartitions = 4 * multiprocessing.cpu_count())
df_2016 = df.2016.map_partitions.(lambda df: df.apply(lambda x: pr.to_lower(x))).compute(scheduler = 'processes')
I would recommend in the future providing code that creates a dataframe so no one has to guess what your data actually looks like. But I think this case was simple enough. Also, I think there were syntax errors in the code you did provide, e.g.,
df.2016.map_partitions
should bedf_2016.map_partitions
. Also, it is not clear what thepr
object is in your code.Given these errors, I just rewrote what I would do to operate on strings in dask and pandas using the
.str
method in a minimum working example similar to your setting. There is very little difference in syntax between pandas and dask for this.EDIT: Added a user supplied function (
to_lower
) to give an example using.apply
in dask.