I am working with a very large dataframe (~3 million rows) and i need the count of values from multiple columns, grouped by time related data.

I have tried to stack the columns but the resulting dataframe was very long and wouldn't fit in the memory. Similarly df.apply gave memory issues.

For example if my sample dataframe is like,

id,date,field1,field2,field3
1,1/1/2014,abc,,abc
2,1/1/2014,abc,,abc
3,1/2/2014,,abc,abc
4,1/4/2014,xyz,abc,
1,1/1/2014,,abc,abc
1,1/1/2014,xyz,qwe,xyz
4,1/7/2014,,qwe,abc
2,1/4/2014,qwe,,qwe
2,1/4/2014,qwe,abc,qwe
2,1/5/2014,abc,,abc
3,1/5/2014,xyz,xyz,

I have written the following script that does the needed for a small sample but fails in a large dataframe.

df.set_index(["id", "date"], inplace=True)
df = df.stack(level=[0])
df = df.groupby(level=[0,1]).value_counts()
df = df.unstack(level=[1,2])

I also have a solution via apply but it has the same complications.

The expected result is,

date 1/1/2014           1/4/2014 ...  1/5/2014 1/4/2014 1/7/2014     
          abc  xyz  qwe      qwe ...       xyz      xyz      abc  qwe
id                               ...                                 
1         4.0  2.0  1.0      NaN ...       NaN      NaN      NaN  NaN
2         2.0  NaN  NaN      4.0 ...       NaN      NaN      NaN  NaN
3         NaN  NaN  NaN      NaN ...       2.0      NaN      NaN  NaN
4         NaN  NaN  NaN      NaN ...       NaN      1.0      1.0  1.0

I am looking for a more optimized version of what I have written.

Thanks for the help !!

1 Answers

0
andy On

You don't want to use stack. Therefore, another solution is using crosstab on id with each date and fields columns. Finally, concat them together, groupby() the index and sum. Use listcomp on df.columns[2:] to create each crosstab (note: I assume the first 2 columns is id and date as your sample):

pd.concat([pd.crosstab([df.id], [df.date, df[col]]) for col in df.columns[2:]]).groupby(level=0).sum()

Out[497]:
   1/1/2014           1/2/2014 1/4/2014           1/5/2014      1/7/2014
        abc  qwe  xyz      abc      abc  qwe  xyz      abc  xyz      abc  qwe
id
1         4  1.0  2.0      0.0      0.0  0.0  0.0      0.0  0.0      0.0  0.0
2         2  0.0  0.0      0.0      1.0  4.0  0.0      2.0  0.0      0.0  0.0
3         0  0.0  0.0      2.0      0.0  0.0  0.0      0.0  2.0      0.0  0.0
4         0  0.0  0.0      0.0      1.0  0.0  1.0      0.0  0.0      1.0  1.0

I think showing 0 is better than NaN. However, if you want NaN instead of 0, you just need to chain additional replace as follows:

pd.concat([pd.crosstab([df.id], [df.date, df[col]]) for col in df.columns[2:]]).groupby(level=0).sum().replace({0: np.nan})

Out[501]:
   1/1/2014           1/2/2014 1/4/2014           1/5/2014      1/7/2014
        abc  qwe  xyz      abc      abc  qwe  xyz      abc  xyz      abc  qwe
id
1       4.0  1.0  2.0      NaN      NaN  NaN  NaN      NaN  NaN      NaN  NaN
2       2.0  NaN  NaN      NaN      1.0  4.0  NaN      2.0  NaN      NaN  NaN
3       NaN  NaN  NaN      2.0      NaN  NaN  NaN      NaN  2.0      NaN  NaN
4       NaN  NaN  NaN      NaN      1.0  NaN  1.0      NaN  NaN      1.0  1.0