Python Pandas DataFrame resample daily data to week by Mon-Sun weekly definition?

25k views Asked by At
import pandas as pd
import numpy as np

dates = pd.date_range('20141229',periods=14, name='Day')
df = pd.DataFrame({'Sum1': [1667, 1229, 1360, 9232, 8866, 4083, 3671, 10085, 10005, 8730, 10056, 10176, 3792, 3518],
                   'Sum2': [91, 75, 75, 254, 239, 108, 99, 259, 395, 355, 332, 386, 96, 111],
                   'Sum3': [365.95, 398.97, 285.12, 992.17, 1116.57, 512.11, 504.47, 1190.96, 1753.6, 1646.25, 1344.05, 1582.67, 560.95, 736.44],
                   'Sum4': [5, 5, 1, 5, 8, 8, 2, 10, 12, 16, 16, 6, 6, 3]},index=dates); print(df)

The df produced looks like this:

             Sum1  Sum2     Sum3  Sum4
Day                                   
2014-12-29   1667    91   365.95     5
2014-12-30   1229    75   398.97     5
2014-12-31   1360    75   285.12     1
2015-01-01   9232   254   992.17     5
2015-01-02   8866   239  1116.57     8
2015-01-03   4083   108   512.11     8
2015-01-04   3671    99   504.47     2
2015-01-05  10085   259  1190.96    10
2015-01-06  10005   395  1753.60    12
2015-01-07   8730   355  1646.25    16
2015-01-08  10056   332  1344.05    16
2015-01-09  10176   386  1582.67     6
2015-01-10   3792    96   560.95     6
2015-01-11   3518   111   736.44     3

Let's say I resample the Dataframe to try and sum the daily data into weekly rows:

df_resampled = df.resample('W', how='sum', label='left'); print(df_resampled)

This produces the following:

             Sum1  Sum2     Sum3  Sum4
Day                                   
2014-12-28  30108   941  4175.36    34
2015-01-04  56362  1934  8814.92    69

Question 1: my definition of a week is Mon - Sun. Since my data starts on 2014-12-29 (a Monday), I want my Day label to also start on that day. How would I make the Day index label be the date of every Monday instead of every Sunday?

Desired Output:

             Sum1  Sum2     Sum3  Sum4
Day                                   
2014-12-29  30108   941  4175.36    34
2015-01-05  56362  1934  8814.92    69

What have I tried regarding Question 1?

I changed 'W' to 'W-MON' but it produced 3 rows by counting 2014-12-29 in 2014-12-22 row which is not what I want:

             Sum1  Sum2     Sum3  Sum4
Day                                   
2014-12-22   1667    91   365.95     5
2014-12-29  38526  1109  5000.37    39
2015-01-05  46277  1675  7623.96    59

Question 2: how would I format the Day index label to look like a range? Ex:

                         Sum1  Sum2     Sum3  Sum4
Day                                   
2014-12-29 - 2015-01-04  30108   941  4175.36    34
2015-01-05 - 2015-01-11  56362  1934  8814.92    69
3

There are 3 answers

1
John Cummings On BEST ANSWER

In case anyone else was not aware, it turns out that the weekly Anchored Offsets are based on the end date. So, just resampling 'W' (which is the same as 'W-SUN') is by default a Monday to Sunday sample. The date listed is the end date. See this old bug report wherein neither the documentation nor the API got updated.

Given that you specified label='left' in the resample parameters, you must have realized that fact. It's also why using 'W-MON' does not have the desired effect. What is confusing is that the left bound is not actually in the interval.

So, to display the start date for the period instead of the end date, you may add a day to the index. That would mean you would do:

df_resampled.index = df_resampled.index + pd.DateOffset(days=1)

For completeness, here is your original data with another day (a Sunday) added on the beginning to show the grouping really is Monday to Sunday:

import pandas as pd
import numpy as np

dates = pd.date_range('20141228',periods=15, name='Day')
df = pd.DataFrame({'Sum1': [10000, 1667, 1229, 1360, 9232, 8866, 4083, 3671, 10085, 10005, 8730, 10056, 10176, 3792, 3518],
               'Sum2': [10000, 91, 75, 75, 254, 239, 108, 99, 259, 395, 355, 332, 386, 96, 111],
               'Sum3': [10000, 365.95, 398.97, 285.12, 992.17, 1116.57, 512.11, 504.47, 1190.96, 1753.6, 1646.25, 1344.05, 1582.67, 560.95, 736.44],
               'Sum4': [10000, 5, 5, 1, 5, 8, 8, 2, 10, 12, 16, 16, 6, 6, 3]},index=dates);
print(df)
df_resampled = df.resample('W', how='sum', label='left')
df_resampled.index = df_resampled.index - pd.DateOffset(days=1)
print(df_resampled)

This outputs:

             Sum1   Sum2      Sum3   Sum4
Day
2014-12-28  10000  10000  10000.00  10000
2014-12-29   1667     91    365.95      5
2014-12-30   1229     75    398.97      5
2014-12-31   1360     75    285.12      1
2015-01-01   9232    254    992.17      5
2015-01-02   8866    239   1116.57      8
2015-01-03   4083    108    512.11      8
2015-01-04   3671     99    504.47      2
2015-01-05  10085    259   1190.96     10
2015-01-06  10005    395   1753.60     12
2015-01-07   8730    355   1646.25     16
2015-01-08  10056    332   1344.05     16
2015-01-09  10176    386   1582.67      6
2015-01-10   3792     96    560.95      6
2015-01-11   3518    111    736.44      3

             Sum1   Sum2      Sum3   Sum4
Day                                      
2014-12-22  10000  10000  10000.00  10000
2014-12-29  30108    941   4175.36     34
2015-01-05  56362   1934   8814.92     69

I believe that is what you wanted for Question 1.

Update

There is now a loffset argument to resample() that allows you to shift the label offset. So, instead of modifying the index, you simple add the loffset argument like so:

df.resample('W', how='sum', label='left', loffset=pd.DateOffset(days=1))

Also of note how=sum is now deprecated in favor of using .sum() on the Resampler object that .resample() returns. So, the fully updated call would be:

df_resampled = df.resample('W', label='left', loffset=pd.DateOffset(days=1)).sum()

Update 1.1.0

The handy loffset argument is deprecated as of version 1.1.0. The documentation indicates the shifting should be done after the resample. In this particular case, I believe that means this is the correct code (untested):

from pandas.tseries.frequencies import to_offset
df_resampled = df.resample('W', label='left').sum()
df_resampled.index = df_resampled.index + to_offset(pd.DateOffset(days=1))
0
Jianxun Li On

This might help.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 1000, (100, 4)), columns='Sum1 Sum2 Sum3 Sum4'.split(), index=pd.date_range('2014-12-29', periods=100, freq='D'))

def func(group):
    return pd.Series({'Sum1': group.Sum1.sum(), 'Sum2': group.Sum2.sum(),
        'Sum3': group.Sum3.sum(), 'Sum4': group.Sum4.sum(), 'Day': group.index[1], 'Period': '{0} - {1}'.format(group.index[0].date(), group.index[-1].date())})

df.groupby(lambda idx: idx.week).apply(func)

Out[386]: 
          Day                   Period  Sum1  Sum2  Sum3  Sum4
1  2014-12-30  2014-12-29 - 2015-01-04  3559  3692  3648  4086
2  2015-01-06  2015-01-05 - 2015-01-11  2990  3658  3348  3304
3  2015-01-13  2015-01-12 - 2015-01-18  3168  3720  3518  3273
4  2015-01-20  2015-01-19 - 2015-01-25  2275  4968  4095  2366
5  2015-01-27  2015-01-26 - 2015-02-01  4146  2167  3888  4576
..        ...                      ...   ...   ...   ...   ...
11 2015-03-10  2015-03-09 - 2015-03-15  4035  3518  2588  2714
12 2015-03-17  2015-03-16 - 2015-03-22  3399  3901  3430  2143
13 2015-03-24  2015-03-23 - 2015-03-29  3227  3308  3185  3814
14 2015-03-31  2015-03-30 - 2015-04-05  4278  3369  3623  4167
15 2015-04-07  2015-04-06 - 2015-04-07  1466   632  1136  1392

[15 rows x 6 columns]
1
xiaotong xu On

Great question.

df_resampled = df.resample('W-MON', label='left', closed='left').sum()

The parameter closed could work for your question.