pandas group by year, id and do some statistics per id? with multiindex

652 views Asked by At

I have a dataframe that is the child of a merge operation on two dataframes. I end up with a multi-index that looks like (timestamp,id) and for the sake of argument, a single column X.

I would like to do several statistics on X by year, and by ID. Instead of posting all the crazy errors that I am getting trying to blindly solve this problem, I ask instead "how would you do this?"

There is one row of X per id, per period (daily). I want to aggregate to an annual period.

2

There are 2 answers

0
jezrael On

I think you can use groupby with resample and aggregate e.g. sum, but need pandas 0.18.1:

start = pd.to_datetime('2016-12-28')
rng = pd.date_range(start, periods=10)
df = pd.DataFrame({'timestamp': rng, 'X': range(10), 
                   'id': ['a'] * 3 + ['b'] * 3 + ['c'] * 4 })  
df = df.set_index(['timestamp','id'])
print (df)
               X
timestamp  id   
2016-12-28 a   0
2016-12-29 a   1
2016-12-30 a   2
2016-12-31 b   3
2017-01-01 b   4
2017-01-02 b   5
2017-01-03 c   6
2017-01-04 c   7
2017-01-05 c   8
2017-01-06 c   9

df = df.reset_index(level='id')
print (df.groupby('id').resample('A')['X'].sum())
id  timestamp 
a   2016-12-31     3
b   2016-12-31     3
    2017-12-31     9
c   2017-12-31    30
Name: X, dtype: int32

Another solution is use get_level_values with groupby:

print (df.X.groupby([df.index.get_level_values('timestamp').year,
                     df.index.get_level_values('id')])
           .sum())
      id
2016  a      3
      b      3
2017  b      9
      c     30
Name: X, dtype: int32
3
Ted Petrou On

If you want to ensure that the groups happen together then you must place all the groups in the groupby.

Assuming your timestamp is in the left outer group, the following should work.

df.groupby([pd.TimeGrouper('A', level=0), pd.Grouper(level='id')])['X'].sum()