Finding max value prior to a change between n categories across a window for m>n changes between categories

63 views Asked by At

I've got a dataset similar to the test data below:

create table #colors (mon int, grp varchar(1), color varchar(5)) 
insert #colors values 
(201501,'A','Red'),
(201502,'A','Red'),
(201503,'A','Red'),
(201504,'A','Red'),
(201505,'A','Red'),
(201506,'A','Red'),
(201501,'B','Red'),
(201502,'B','Red'),
(201503,'B','Blue'),
(201504,'B','Blue'),
(201505,'B','Blue'),
(201506,'B','Blue'),
(201501,'C','Red'),
(201502,'C','Red'),
(201503,'C','Blue'),
(201504,'C','Green'),
(201505,'C','Green'),
(201506,'C','Green'),
(201501,'D','Red'),
(201502,'D','Red'),
(201503,'D','Blue'),
(201504,'D','Blue'),
(201505,'D','Red'),
(201506,'D','Red')

I'd like to know the path each group took in terms of color, as well the latest month a category was a particular color prior to a change in color. In this way, the month associated with a color serves as the upper temporal bound for the category-color combination.

I've tried to do this with a CTE and the row_number() function, as in the code below, but it doesn't quite work.

Here is the sample code:

; with colors (grp, color, mon, rn) as (
    select  grp
        ,   color
        ,   mon
        ,   row_number() over (partition by grp order by mon asc) rn
    from    (
        select  grp
            ,   color
            ,   max(mon) mon
        from    #colors
        group by grp, color
        ) as z
    )
    select  grp
        ,   firstColor
        ,   firstMonth
        ,   secondColor
        ,   secondMonth
        ,   thirdColor
        ,   thirdMonth
    from    (
        select  c1.grp
            ,   c1.color firstColor
            ,   c1.mon firstMonth
            ,   c2.color secondColor
            ,   c2.mon secondMonth
            ,   c3.color thirdColor
            ,   c3.mon thirdMonth
            ,   row_number() over (partition by c1.grp order by c1.mon asc) rn
        from    colors c1 left outer join colors c2 on (
                        c1.grp = c2.grp
                    and c1.color <> c2.color
                    and c1.rn = c2.rn - 1
                ) left outer join colors c3 on (
                        c1.grp = c3.grp
                    and c2.color <> c3.color
                    and c2.rn = c3.rn - 1
                )
        ) as d
    where   rn = 1
    order by grp

Which results in the following (incorrect) result set: result set

As you can see, there is no indication that group D's original color was red-- it should be Red (201502) --> Blue (201504) --> Red (201506). This is because of the use of the max() function, but removing it necessitates modifying the join logic in a way I haven't been able to deduce.

I've tried removing the max() function and changing the partition on the row_number() to include the color, but I think this reduces to the same sets logically.

How do I account for the scenario when there are fewer categories than changes between those categories?

3

There are 3 answers

2
Amit On BEST ANSWER

I'd take a different approach, and generally I'd avoid "predefining" the number of months in the columns (if possible). Here's a solution that can work for separating months into rows, but it actually then combines the result into the expected output format:

WITH nCTE (mon, grp, color, n) AS (
  SELECT *, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY mon) n
  FROM colors
), monthsCTE (mon, grp, color, n) AS (
  SELECT l.mon, l.grp, l.color, ROW_NUMBER() OVER(PARTITION BY l.grp ORDER BY l.mon) n
  FROM nCTE l LEFT JOIN nCTE r
    ON l.grp = r.grp AND l.n = r.n - 1
  WHERE l.color != r.color OR r.color IS NULL
)

SELECT m1.grp, m1.color, m1.mon, m2.color, m2.mon, m3.color, m3.mon
FROM monthsCTE m1 LEFT JOIN monthsCTE m2
  ON m1.grp = m2.grp AND m2.n = 2 LEFT JOIN monthsCTE m3
  ON m1.grp = m3.grp AND m3.n = 3
WHERE m1.n = 1
ORDER BY 1

And a fiddle

You can use the "inside" of the monthsCTE instead of the outer SELECT to get the result in separate rows (then you don't need the ROW_NUMBER... part), or leave it like this...

EDIT: It's actually easier to do what you REALLY wanted. Just remove the GROUP BY clause (and the interrupting MAX() functions).

EDIT2: As noted by Me.Name, old solution would fail over years. Corrected code fragment & fiddle.

0
Me.Name On

Using a slightly different approach, with first using the lead window function to determine if the color changed, and only then ranking the rows, based on where the color changed:

;with nextcols as
(
    select grp, color, mon, lead(color, 1, 'none') over (partition by grp order by mon  ) nextcol from #colors
)
, ranked as
(
    select *, ROW_NUMBER() over (partition by grp order by mon) MonthIndex from nextcols where color <> nextcol
) 
--perhaps you could go pivoting here, but joining on the monthindex works
select r1.grp, r1.color firstCol, r1.mon firstMon, r2.color secondCol, r2.mon secondMon, r3.color thirdCol, r3.mon thirdMon
from ranked r1
left join ranked r2 on r2.grp=r1.grp and r2.MonthIndex = 2
left join ranked r3 on r3.grp=r1.grp and r3.MonthIndex = 3
where r1.MonthIndex = 1

Fiddle

0
Philip Devine On

EDIT - It works now!

WITH colors AS(
    SELECT *
        , ROW_NUMBER() OVER (partition by grp ORDER BY mon desc) RowNumberOrder

    FROM colorss
)
select * from (
SELECT row_number() over (partition by c1.grp order by c1.rowNumberOrder asc) rn, c1.grp
            ,   c1.color firstColor
            ,   c1.mon firstMonth
            ,   c2.color secondColor
            ,   c2.mon secondMonth
            ,   c3.color thirdColor
            ,   c3.mon thirdMonth
FROM colors c1 
  left join colors c2 on c1.grp = c2.grp and c1.RowNumberOrder < c2.rowNumberOrder and c1.color <> c2.color
  left join colors c3 on c3.grp = c2.grp and c2.RowNumberOrder < c3.rowNumberOrder and c2.color <> c3.color
  ) a where rn = 1

SQL fiddle: http://sqlfiddle.com/#!3/e0d90/36