Full Outer Self join with data for different date

114 views Asked by At

I want to compare values with values from following day (and also want to see which color is new or didn't appear).

I've did full outer self join and replaced null values in the section on the right of 'is_matched'. Is_matched is showing us if join worked or section on the right would be null without coalesce.

The only thing is very last column 'this_is_not_working'. It should have total_colours value for 'date_local2', not for 'date_local' and I couldn't figure out how to replace all nulls with values in 'this_is_not_working' column. I've tried with window functions and intervals but didn't really work.

I have used Postgres to create this db fiddle but I'm using Presto.

select * from colours
date_local | colour | amount
:--------- | :----- | -----:
2020-01-01 | white  |     10
2020-01-01 | white  |     10
2020-01-01 | green  |     20
2020-01-01 | white  |     10
2020-01-01 | red    |     25
2020-01-01 | white  |     10
2020-01-02 | pink   |     15
2020-01-02 | pink   |     15
2020-01-02 | pink   |     15
2020-01-02 | pink   |     15
2020-01-02 | white  |     10
2020-01-02 | white  |     10
2020-01-02 | white  |     10
2020-01-02 | white  |     10
2020-01-02 | white  |     10
2020-01-03 | pink   |     15
2020-01-03 | pink   |     15
2020-01-03 | pink   |     15
2020-01-03 | green  |     20
2020-01-03 | green  |     20
2020-01-03 | green  |     20
with a as( 
    select
        *
        ,sum(colours) over(partition by date_local) as total_colour
    from ( 
        select 
            date_local
            ,colour
            ,count(colour) as colours 
            ,sum(amount) as amount 
        from colours
        group by 1,2
    ) as fr_om 
) 

select
    a.*
    ,b.date_local as is_matched
    ,coalesce(b.date_local, a.date_local + interval '1' day) as date_local_2
    ,coalesce(b.colour, a.colour) as colour_2
    ,coalesce(b.colours, 0) as colour_2
    ,coalesce(b.amount, 0) as amount_2
    ,coalesce(b.colours - a.colours, a.colours) as colour_difference
    ,coalesce(b.amount - a.amount, a.amount) as amount_difference
    ,b.total_colour as this_is_not_working
from a 
full outer join a as b 
    on a.date_local = b.date_local - interval '1' day 
    and a.colour = b.colour
order by 1
date_local | colour | colours | amount | total_colour | is_matched | date_local_2        | colour_2 | colour_2 | amount_2 | colour_difference | amount_difference | this_is_not_working
:--------- | :----- | ------: | -----: | -----------: | :--------- | :------------------ | :------- | -------: | -------: | ----------------: | ----------------: | ------------------:
2020-01-01 | red    |       1 |     25 |            6 | null       | 2020-01-02 00:00:00 | red      |        0 |        0 |                 1 |                25 |                null
2020-01-01 | green  |       1 |     20 |            6 | null       | 2020-01-02 00:00:00 | green    |        0 |        0 |                 1 |                20 |                null
2020-01-01 | white  |       4 |     40 |            6 | 2020-01-02 | 2020-01-02 00:00:00 | white    |        5 |       50 |                 1 |                10 |                   9
2020-01-02 | pink   |       4 |     60 |            9 | 2020-01-03 | 2020-01-03 00:00:00 | pink     |        3 |       45 |                -1 |               -15 |                   6
2020-01-02 | white  |       5 |     50 |            9 | null       | 2020-01-03 00:00:00 | white    |        0 |        0 |                 5 |                50 |                null
2020-01-03 | pink   |       3 |     45 |            6 | null       | 2020-01-04 00:00:00 | pink     |        0 |        0 |                 3 |                45 |                null
2020-01-03 | green  |       3 |     60 |            6 | null       | 2020-01-04 00:00:00 | green    |        0 |        0 |                 3 |                60 |                null
null       | null   |    null |   null |         null | 2020-01-02 | 2020-01-02 00:00:00 | pink     |        4 |       60 |              null |              null |                   9
null       | null   |    null |   null |         null | 2020-01-01 | 2020-01-01 00:00:00 | white    |        4 |       40 |              null |              null |                   6
null       | null   |    null |   null |         null | 2020-01-03 | 2020-01-03 00:00:00 | green    |        3 |       60 |              null |              null |                   6
null       | null   |    null |   null |         null | 2020-01-01 | 2020-01-01 00:00:00 | green    |        1 |       20 |              null |              null |                   6
null       | null   |    null |   null |         null | 2020-01-01 | 2020-01-01 00:00:00 | red      |        1 |       25 |              null |              null |                   6
1

There are 1 answers

0
GMB On

I don't think that you need a full join for this. Window functions can get the job done:

select 
    date_local,
    colour,
    no_colour,
    sum_amount,
    total_colour,
    is_matched,
    case when is_matched = 1
        then lead(date_local) over(partition by colour order by date_local)
    end date_local_2,
    case when is_matched = 1
        then lead(colour) over(partition by colour order by date_local) 
    end colour_2,
    case when is_matched = 1 
        then lead(no_colour) over(partition by colour order by date_local) 
    end no_colour_2,
    case when is_matched = 1 
        then lead(sum_amount) over(partition by colour order by date_local) 
    end sum_amount_2,
    case when is_matched = 1 
        then lead(total_colour) over(partition by colour order by date_local) 
    end total_colour_2  
from (
    select
        date_local,
        colour,
        count(*) no_colour,
        sum(amount) sum_amount,
        case when lead(date_local) over(partition by colour order by date_local) 
            = date_local + interval '1' day
            then 1
        end is_matched,
        sum(count(*)) over(partition by date_local) total_colour
    from colours
    group by date_local, colour
) t
order by date_local, colour

The inner query aggregates by day and color, and computes group-level metrics as well as the overall count of records per day; it also sets a flag that indicates if an "adjacent" record exist on the next day for the same color.

Then, the outer query uses window function lead() to recover the values from the adjacent row.

In your db fiddle, this yields:

date_local | colour | no_colour | sum_amount | total_colour | is_matched | date_local_2 | colour_2 | no_colour_2 | sum_amount_2 | total_colour_2
:--------- | :----- | --------: | ---------: | -----------: | ---------: | :----------- | :------- | ----------: | -----------: | -------------:
2020-01-01 | green  |         1 |         20 |            6 |       null | null         | null     |        null |         null |           null
2020-01-01 | red    |         1 |         25 |            6 |       null | null         | null     |        null |         null |           null
2020-01-01 | white  |         4 |         40 |            6 |          1 | 2020-01-02   | white    |           5 |           50 |              9
2020-01-02 | pink   |         4 |         60 |            9 |          1 | 2020-01-03   | pink     |           3 |           45 |              6
2020-01-02 | white  |         5 |         50 |            9 |       null | null         | null     |        null |         null |           null
2020-01-03 | green  |         3 |         60 |            6 |       null | null         | null     |        null |         null |           null
2020-01-03 | pink   |         3 |         45 |            6 |       null | null         | null     |        null |         null |           null