Splitting one row with date field into multiple rows with specified quarter dates

33 views Asked by At

I have a data frame with start date and end date. I want to split that one row into multiple rows with date range in pre defined quarters. pre defined quarters(irrespective of year) are: Q1-Apr-Jun Q2-Jul-Sep Q3-Oct-Dec Q4-Jan-Mar

The row has to be split between the start and end date but split on the pre defined quarter months.

Input dataFrame:

Pol_num start_date end_date
p1 2019-05-12 2020-05-11
p2 2018-11-28 2019-07-29

The output I want is below:

Pol_num Quarter_start_date Quarter_end_date Quarter
p1 2019-05-12 2019-06-30 Q1
p1 2019-07-01 2019-09-30 Q2
p1 2019-10-01 2019-12-31 Q3
p1 2020-01-01 2020-03-31 Q4
p1 2020-04-01 2020-05-11 Q1
p2 2018-11-28 2018-12-31 Q3
p2 2019-01-01 2019-03-31 Q4
p2 2019-04-01 2019-06-30 Q1
p2 2019-07-01 2019-07-29 Q2

Can anyone help with this?

1

There are 1 answers

1
mozway On

One option could be to generate all dates with date_range then to explode, then post-process the output to compute the Quarter_start_date and the Quarter, and fix the Quarter_end_date:

# ensure datetime
df[['start_date', 'end_date']] = (df[['start_date', 'end_date']]
                                  .apply(pd.to_datetime)
                                  )

out = (
 df.assign(Quarter_end_date=[pd.date_range(start, end+pd.offsets.QuarterEnd(0),
                                           freq='Q')
                             for start, end in zip(df['start_date'],
                                                   df['end_date'])])
   .explode('Quarter_end_date')
   .assign(Quarter_start_date=lambda d: d['Quarter_end_date']
           .groupby(level=0).shift()
           .add(pd.Timedelta('1d'))
           .fillna(d['start_date']),
           Quarter_end_date=lambda d: d['Quarter_end_date']
           .where(d.index.duplicated(keep='last'), d['end_date']),
           Quarter=lambda d: 'Q'+d['Quarter_end_date'].dt.quarter.astype(str)
          )
    [['Pol_num', 'Quarter_start_date', 'Quarter_end_date', 'Quarter']]
)

Output:

  Pol_num Quarter_start_date Quarter_end_date Quarter
0      p1         2019-05-12       2019-06-30      Q2
0      p1         2019-07-01       2019-09-30      Q3
0      p1         2019-10-01       2019-12-31      Q4
0      p1         2020-01-01       2020-03-31      Q1
0      p1         2020-04-01       2020-05-11      Q2
1      p2         2018-11-28       2018-12-31      Q4
1      p2         2019-01-01       2019-03-31      Q1
1      p2         2019-04-01       2019-06-30      Q2
1      p2         2019-07-01       2019-07-29      Q3

NB. you could also start by repeating the rows with:

n = (df['end_date'].dt.to_period('Q')
     .sub(df['start_date'].dt.to_period('Q'))
     .apply(lambda x: x.n).add(1)
    )

out = df.loc[df.index.repeat(n)]

Then compute the start/end/quarter by shifting the dates with increasing QuarterEnd. However, since the addition of QuarterEnd and the conversion of periods to number of periods are not vectorized, this probably won't give any benefit.