Any help on a more precise title to this question is welcome..
I have a pandas
dataframe with customer-level observations that records a date, and items consumed by the customer on that date. It looks like this.
df
store day items
a 1 4
a 1 3
a 2 1
a 3 5
a 4 2
a 5 9
b 1 1
b 2 3
Each observation in this data set pertains to a unique store-day combination, BUT each store-day observation is listed conditional on a positive number of items consumed, i.e. df[items] > 0
for every store-day pair.
So I do not have, for example
b 3 0
b 4 0
b 5 0
etc.
I need to group this dataframe by store
and day
, and then run some operations on all obs in each store-day group.
But, I want these lines to exist and with 0 length (null sets), and I am not sure the best way to do this. This is a very simple toy dataset. The real one is very large.
I don't really want to add in the observations BEFORE using df.groupby(['store', 'day'])
, because
I run OTHER calculations on each store-day group that uses the length of each group as a measure of number of customers recorded in a specific store and day. Thus, if I add in those observations b3
and b4
, then it looks like there were 2 customers who visited the store b on days 3 and 4 - when there were not (each bought nothing at store b on days 3 and 4).
The 'pandas' way of representing those would probably be to code it as missing data, like:
Then, in your aggregation to count customers, you could use
count
which excludes missing values, for example:EDIT:
In terms of adding missing values, here a couple thoughts. Say you have a DataFrame that contains just the missing pairs, like this:
Then you could just append these to your existing DataFrame to fill the missing, like this:
Alternatively, if you a DataFrame with the pairs you 'should' have, (a 1-5, b 1-4), you could merge that against the data to fill the missing. For example: