I am working with a very large dataframe (~3 million rows) and i need the count of values from multiple columns, grouped by time related data.
I have tried to stack the columns but the resulting dataframe was very long and wouldn't fit in the memory. Similarly df.apply gave memory issues.
For example if my sample dataframe is like,
id,date,field1,field2,field3 1,1/1/2014,abc,,abc 2,1/1/2014,abc,,abc 3,1/2/2014,,abc,abc 4,1/4/2014,xyz,abc, 1,1/1/2014,,abc,abc 1,1/1/2014,xyz,qwe,xyz 4,1/7/2014,,qwe,abc 2,1/4/2014,qwe,,qwe 2,1/4/2014,qwe,abc,qwe 2,1/5/2014,abc,,abc 3,1/5/2014,xyz,xyz,
I have written the following script that does the needed for a small sample but fails in a large dataframe.
df.set_index(["id", "date"], inplace=True) df = df.stack(level=) df = df.groupby(level=[0,1]).value_counts() df = df.unstack(level=[1,2])
I also have a solution via
apply but it has the same complications.
The expected result is,
date 1/1/2014 1/4/2014 ... 1/5/2014 1/4/2014 1/7/2014 abc xyz qwe qwe ... xyz xyz abc qwe id ... 1 4.0 2.0 1.0 NaN ... NaN NaN NaN NaN 2 2.0 NaN NaN 4.0 ... NaN NaN NaN NaN 3 NaN NaN NaN NaN ... 2.0 NaN NaN NaN 4 NaN NaN NaN NaN ... NaN 1.0 1.0 1.0
I am looking for a more optimized version of what I have written.
Thanks for the help !!