I've got a dataset similar to the test data below:
create table #colors (mon int, grp varchar(1), color varchar(5))
insert #colors values
(201501,'A','Red'),
(201502,'A','Red'),
(201503,'A','Red'),
(201504,'A','Red'),
(201505,'A','Red'),
(201506,'A','Red'),
(201501,'B','Red'),
(201502,'B','Red'),
(201503,'B','Blue'),
(201504,'B','Blue'),
(201505,'B','Blue'),
(201506,'B','Blue'),
(201501,'C','Red'),
(201502,'C','Red'),
(201503,'C','Blue'),
(201504,'C','Green'),
(201505,'C','Green'),
(201506,'C','Green'),
(201501,'D','Red'),
(201502,'D','Red'),
(201503,'D','Blue'),
(201504,'D','Blue'),
(201505,'D','Red'),
(201506,'D','Red')
I'd like to know the path each group took in terms of color, as well the latest month a category was a particular color prior to a change in color. In this way, the month associated with a color serves as the upper temporal bound for the category-color combination.
I've tried to do this with a CTE and the row_number()
function, as in the code below, but it doesn't quite work.
Here is the sample code:
; with colors (grp, color, mon, rn) as (
select grp
, color
, mon
, row_number() over (partition by grp order by mon asc) rn
from (
select grp
, color
, max(mon) mon
from #colors
group by grp, color
) as z
)
select grp
, firstColor
, firstMonth
, secondColor
, secondMonth
, thirdColor
, thirdMonth
from (
select c1.grp
, c1.color firstColor
, c1.mon firstMonth
, c2.color secondColor
, c2.mon secondMonth
, c3.color thirdColor
, c3.mon thirdMonth
, row_number() over (partition by c1.grp order by c1.mon asc) rn
from colors c1 left outer join colors c2 on (
c1.grp = c2.grp
and c1.color <> c2.color
and c1.rn = c2.rn - 1
) left outer join colors c3 on (
c1.grp = c3.grp
and c2.color <> c3.color
and c2.rn = c3.rn - 1
)
) as d
where rn = 1
order by grp
Which results in the following (incorrect) result set:
As you can see, there is no indication that group D's original color was red-- it should be Red (201502) --> Blue (201504) --> Red (201506). This is because of the use of the max()
function, but removing it necessitates modifying the join logic in a way I haven't been able to deduce.
I've tried removing the max()
function and changing the partition on the row_number()
to include the color, but I think this reduces to the same sets logically.
How do I account for the scenario when there are fewer categories than changes between those categories?
I'd take a different approach, and generally I'd avoid "predefining" the number of months in the columns (if possible). Here's a solution that can work for separating months into rows, but it actually then combines the result into the expected output format:
And a fiddle
You can use the "inside" of the monthsCTE instead of the outer
SELECT
to get the result in separate rows (then you don't need theROW_NUMBER...
part), or leave it like this...