Groupby in pandas, filling in missing groups with []

4.7k views Asked by At

Any help on a more precise title to this question is welcome..

I have a pandas dataframe with customer-level observations that records a date, and items consumed by the customer on that date. It looks like this.

df
store    day   items
 a        1     4
 a        1     3
 a        2     1
 a        3     5
 a        4     2 
 a        5     9
 b        1     1 
 b        2     3

Each observation in this data set pertains to a unique store-day combination, BUT each store-day observation is listed conditional on a positive number of items consumed, i.e. df[items] > 0 for every store-day pair.

So I do not have, for example

b         3      0
b         4      0 
b         5      0

etc.

I need to group this dataframe by store and day, and then run some operations on all obs in each store-day group.

But, I want these lines to exist and with 0 length (null sets), and I am not sure the best way to do this. This is a very simple toy dataset. The real one is very large.

I don't really want to add in the observations BEFORE using df.groupby(['store', 'day']), because I run OTHER calculations on each store-day group that uses the length of each group as a measure of number of customers recorded in a specific store and day. Thus, if I add in those observations b3 and b4, then it looks like there were 2 customers who visited the store b on days 3 and 4 - when there were not (each bought nothing at store b on days 3 and 4).

3

There are 3 answers

5
chrisb On

The 'pandas' way of representing those would probably be to code it as missing data, like:

In [562]: df
Out[562]: 
  store  day  items
0     a    1      4
1     a    1      3
2     a    2      1
3     a    3      5
4     a    4      2
5     a    5      9
6     b    1      1
7     b    2      3
8     b    3    NaN
9     b    4    NaN

Then, in your aggregation to count customers, you could use count which excludes missing values, for example:

In [565]: df.groupby('store')['items'].count()
Out[565]: 
store
a        6
b        2
Name: items, dtype: int64

EDIT:

In terms of adding missing values, here a couple thoughts. Say you have a DataFrame that contains just the missing pairs, like this:

In [571]: df_missing
Out[571]: 
  store  day
8     b    3
9     b    4

Then you could just append these to your existing DataFrame to fill the missing, like this:

In [574]: pd.concat([df, df_missing], ignore_index=True)
Out[574]: 
   day  items store
0    1      4     a
1    1      3     a
2    2      1     a
3    3      5     a
4    4      2     a
5    5      9     a
6    1      1     b
7    2      3     b
8    3    NaN     b
9    4    NaN     b

Alternatively, if you a DataFrame with the pairs you 'should' have, (a 1-5, b 1-4), you could merge that against the data to fill the missing. For example:

In [577]: df_pairs
Out[577]: 
  store  day
0     a    1
1     a    1
2     a    2
3     a    3
4     a    4
5     a    5
6     b    1
7     b    2
8     b    3
9     b    4

In [578]: df_pairs.merge(df, how='left')
Out[578]: 
   store  day  items
0      a    1      4
1      a    1      3
2      a    1      4
3      a    1      3
4      a    2      1
5      a    3      5
6      a    4      2
7      a    5      9
8      b    1      1
9      b    2      3
10     b    3    NaN
11     b    4    NaN
0
Balint On

I do not know the best way to store the zero values, but you can create them when aggregating:

df.pivot_table('items', 'store', 'day', fill_value=0, aggfunc='count')

or

df.groupby(['store', 'day']).count().unstack().fillna(0)
0
valearner On

You may already have answer to your question, in case someone else, like me, is looking for an answer. Try:

pd.crosstab(df.store, df.day, margins=False)

This will give you df with store as index and day as column. you can do something like:

df.reset_index(level=0, inplace=True) 

to convert the index to a column and if you have multi index columns, something like:

df.columns = [''.join(col).strip() for col in df.columns.values]

to get a "flat" df.

You can do this:

pd.crosstab([df.store, df.day.....], [df.store, df.day.....],margins=False)