How to identify and aggregate sequence from start and end dates

1k views Asked by At

I'm trying to identify a consecutive sequence in dates, per person, as well as sum amount for that sequence. My records table looks like this:

person   start_date   end_date     amount
1        2015-09-10   2015-09-11   500
1        2015-09-11   2015-09-12   100
1        2015-09-13   2015-09-14   200
1        2015-10-05   2015-10-07   2000
2        2015-10-05   2015-10-05   300
2        2015-10-06   2015-10-06   1000
3        2015-04-23   2015-04-23   900

The resulting query should be this:

person   sequence_start_date   sequence_end_date     amount
1        2015-09-10            2015-09-14            800
1        2015-10-05            2015-10-07            2000
2        2015-10-05            2015-10-06            1400
3        2015-04-23            2015-04-23            900

Below, I can use LAG and LEAD to identify the sequence start_date and end_date, but I don't have a way to aggregate the amount. I'm assuming the answer will involve some sort of ROW_NUMBER() window function that will partition by sequence, I just can't figure out how to make the sequence identifiable to the function.

SELECT
 person
 ,COALESCE(sequence_start_date, LAG(sequence_start_date, 1) OVER (ORDER BY person, start_date)) AS "sequence_start_date"
 ,COALESCE(sequence_end_date, LEAD(sequence_end_date, 1) OVER (ORDER BY person, start_date)) AS "sequence_end_date"
FROM
(
 SELECT
  person
  ,start_date
  ,end_date
  ,CASE WHEN LAG(end_date, 1) OVER (PARTITION BY person ORDER BY start_date) + interval '1 day' = start_date
   THEN NULL
   ELSE start_date
  END AS "sequence_start_date"
  ,CASE WHEN LEAD(start_date, 1) OVER (PARTITION BY person ORDER BY start_date) - interval '1 day' = end_date
   THEN NULL
   ELSE end_date
  END AS "sequence_end_date"
  ,amount
 FROM records
) sq
2

There are 2 answers

0
John Bollinger On BEST ANSWER

Even your updated (sub)query still isn't quite right for the data you've presented, which is inconsistent about whether the start date of the second and subsequent rows in a sequence should be equal to their previous rows' end date or one day later. The query can be updated pretty easily to accommodate both, if that's needed.

In any case, you cannot use COALESCE as a window function. Aggregate functions may be used as window functions by providing an OVER clause, but not ordinary functions. There are nevertheless ways to apply window function to this task. Here's a way to identify the sequences in your data (as presented):

SELECT
  person
  ,MAX(sequence_start_date)
     OVER (
       PARTITION BY person
       ORDER BY start_date
       ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
     AS "sequence_start_date"
  ,MIN(sequence_end_date)
     OVER (
       PARTITION BY person
       ORDER BY start_date
       ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
     AS "sequence_end_date"
  ,amount
FROM
(
 SELECT
  person
  ,start_date
  ,end_date
  ,CASE WHEN LAG(end_date, 1) OVER (PARTITION BY person ORDER BY start_date) + interval '1 day' >= start_date
   THEN date '0001-01-01'
   ELSE start_date
   END AS "sequence_start_date"
  ,CASE WHEN LEAD(start_date, 1) OVER (PARTITION BY person ORDER BY start_date) - interval '1 day' <= end_date
   THEN NULL
   ELSE end_date
   END AS "sequence_end_date"
  ,amount
 FROM records
 order by person, start_date
) sq_part
ORDER BY person, sequence_start_date

That relies on MAX() and MIN() instead of COALESCE(), and it applies window framing to get the appropriate scope for each of those within each partition. Results:

person  sequence_start_date         sequence_end_date           amount
1       September, 10 2015 00:00:00 September, 12 2015 00:00:00 500
1       September, 10 2015 00:00:00 September, 12 2015 00:00:00 100
1       October, 05 2015 00:00:00   October, 07 2015 00:00:00   2000
2       October, 05 2015 00:00:00   October, 06 2015 00:00:00   300
2       October, 05 2015 00:00:00   October, 06 2015 00:00:00   1000
3       April, 23 2015 00:00:00     April, 23 2015 00:00:00     900

Do note that that does not require an exact match of end date with subsequent start date; all rows for each person that abut or overlap will be assigned to the same sequence. If (person, start_date) cannot be relied upon to be unique, however, then you probably need to order the partitions by end date as well.

And now you have a way to identify the sequences: they are characterized by the triple person, sequence_start_date, sequence_end_date. (Or actually, you need only the person and one of those dates for identification purposes, but read on.) You can wrap the above query as an inline view of an outer aggregate query to produce your desired result:

SELECT
  person,
  sequence_start_date,
  sequence_end_date,
  SUM(amount) AS "amount"
FROM ( <above query> ) sq
GROUP BY person, sequence_start_date, sequence_end_date

Of course you need both dates as grouping columns if you're going to select them.

0
JohnHC On

Why not:

select a1.person, a1.sequence_start_date, a1.sequence_end_date, 
       sum(rx.amount) 
         as amount
from (EXISTING_QUERY) a1
left join records rx 
  on rx.person = a1.person 
  and rx.start_date >= a1.start_date
  and rx.end_date <= a1.end_date
group by a1.person, a1.sequence_start_date, a1.sequence_end_date