How to select data for especific time intervals after using Pandas’ resample function?

119 views Asked by At

I used Pandas’ resample function for calculating the sales of a list of proucts every 6 months. I used the resample function for ‘6M’ and using apply({“column-name”:”sum”}).

Now I’d like to create a table with the sum of the sales for the first six months.

How can I extract the sum of the first 6 months, given that all products have records for more than 3 years, and none of them have the same start date?

Thanks in advance for any suggestions.

Here is an example of the data:

Product     Date        sales
Product 1   6/30/2017   20
            12/31/2017  60
            6/30/2018   50
            12/31/2018  100
Product 2   1/31/2017   30
            7/31/2017   150
            1/31/2018   200
            7/31/2018   300
            1/31/2019   100
1

There are 1 answers

2
Joe Ferndz On BEST ANSWER

While waiting for your data, I worked on this. See if this is something that will be helpful for you.

import pandas as pd
df = pd.DataFrame({'Date':['2018-01-10','2018-02-15','2018-03-18',
                           '2018-07-10','2018-09-12','2018-10-14',
                           '2018-11-16','2018-12-20','2019-01-10',
                           '2019-04-15','2019-06-12','2019-10-18',
                           '2019-12-02','2020-01-05','2020-02-25',
                           '2020-03-15','2020-04-11','2020-07-22'],
                   'Sales':[200,300,100,250,150,350,150,200,250,
                            200,300,100,250,150,350,150,200,250]})

#first breakdown the data by Yearly Quarters
df['YQtr'] = pd.PeriodIndex(pd.to_datetime(df.Date), freq='Q')

#next create a column to identify Half Yearly - H1 for Jan-Jun & H2 for Jul-Dec
df.loc[df['YQtr'].astype(str).str[-2:].isin(['Q1','Q2']),'HYear'] = df['YQtr'].astype(str).str[:-2]+'H1' 
df.loc[df['YQtr'].astype(str).str[-2:].isin(['Q3','Q4']),'HYear'] = df['YQtr'].astype(str).str[:-2]+'H2' 

#Do a cummulative sum on Half Year to get sales by H1 & H2 for each year
df['HYear_cumsum'] = df.groupby('HYear')['Sales'].cumsum()

#Now filter out only the rows with the max value. That's the H1 & H2 sales figure
df1 = df[df.groupby('HYear')['HYear_cumsum'].transform('max')== df['HYear_cumsum']]

print (df)
print (df1)

The output of this will be:

Source Data + Half Year cumulative sum:

          Date  Sales    YQtr   HYear  HYear_cumsum
0   2018-01-10    200  2018Q1  2018H1           200
1   2018-02-15    300  2018Q1  2018H1           500
2   2018-03-18    100  2018Q1  2018H1           600
3   2018-07-10    250  2018Q3  2018H2           250
4   2018-09-12    150  2018Q3  2018H2           400
5   2018-10-14    350  2018Q4  2018H2           750
6   2018-11-16    150  2018Q4  2018H2           900
7   2018-12-20    200  2018Q4  2018H2          1100
8   2019-01-10    250  2019Q1  2019H1           250
9   2019-04-15    200  2019Q2  2019H1           450
10  2019-06-12    300  2019Q2  2019H1           750
11  2019-10-18    100  2019Q4  2019H2           100
12  2019-12-02    250  2019Q4  2019H2           350
13  2020-01-05    150  2020Q1  2020H1           150
14  2020-02-25    350  2020Q1  2020H1           500
15  2020-03-15    150  2020Q1  2020H1           650
16  2020-04-11    200  2020Q2  2020H1           850
17  2020-07-22    250  2020Q3  2020H2           250

The half year cumulative sum for each half year.

          Date  Sales    YQtr   HYear  HYear_cumsum
2   2018-03-18    100  2018Q1  2018H1           600
7   2018-12-20    200  2018Q4  2018H2          1100
10  2019-06-12    300  2019Q2  2019H1           750
12  2019-12-02    250  2019Q4  2019H2           350
16  2020-04-11    200  2020Q2  2020H1           850
17  2020-07-22    250  2020Q3  2020H2           250

I will look at your sample data and work on it later tonight.