Generate absent dates with the closest values

84 views Asked by At

There is a table (mileage_event) which shows mileage of all cars. Contains the following columns: created_at - date, car_id - car id, car_mileage - mileage of a car

It is worth considering that not all days have data. For example:

car_id created_at car_mileage
d016e61c5326-f246894a 2023-05-22 100000
d016e61c5326-f246894a 2023-05-26 105000
d016e61c5326-f246894a 2023-05-27 110000

As you can see, there are no rows for this car between 2023-05-22 and 2023-05-26 (i.e. 3 days are absent). It means that the car hadn't been on drive for these days.

What do I want: to generate the missing days and add the nearest mileage value to them.

(Note: it's Exasol database)
As a result:

car_id created_at car_mileage
d016e61c5326-f246894a 2023-05-22 100000
d016e61c5326-f246894a 2023-05-23 100000
d016e61c5326-f246894a 2023-05-24 100000
d016e61c5326-f246894a 2023-05-25 100000
d016e61c5326-f246894a 2023-05-26 105000
d016e61c5326-f246894a 2023-05-27 110000

Thanks in advance!

UPD: I could be able to come up with this but it seems I'm wrong somewhere and do not see an error:

WITH date_series AS (select add_days(current_date - interval '2' year, level - 1) as dates
                from dual
                connect by level <= days_between(current_date,current_date - interval '2' year) + 1
                order by local.dates) --to constantly get only last 2 years    
,
combined AS (
    SELECT
        m.car_id as car_id,
        d.dates,
        COALESCE(m.car_mileage,
                 LAG(m.car_mileage, 1) OVER (PARTITION BY m.car_id ORDER BY d.dates)) AS car_mileage
    FROM date_series d
    LEFT JOIN mileage_event m ON d.dates = m.created_at
)
SELECT * FROM combined ORDER BY car_id, dates

And still I get this: | car_id | created_at | car_mileage | | --- | --- | --- | | d016e61c5326-f246894a | 2023-05-22 | 100000 | | null | 2023-05-23 | null | | null | 2023-05-24 | null | | null | 2023-05-25 | null | | d016e61c5326-f246894a | 2023-05-26 | 105000 | | d016e61c5326-f246894a | 2023-05-27 | 110000 |

I get that it's left join so it obviously cannot find the right values (and shows null instead).

2

There are 2 answers

3
Mr.Jones On

Ah, always good to see someone asking a legitimate question and people commenting "You should google it".

You can achieve this by leveraging common table expressions (CTEs), window functions, and joining with a series of dates. The general strategy is to:

Generate a series of dates spaning the range you're interested in. For each car_id, use a LEFT JOIN to combine this date series with your existing mileage data. Employ window functions to fill in the missing mileage data. Here is how i would do it:

-- Create a CTE with a series of dates. Adjust the start and end dates as needed.
WITH date_series AS (
    SELECT CAST('2023-05-22' AS DATE) + ROW_NUMBER() OVER() - 1 AS created_at
    FROM (SELECT 1 FROM DUAL CONNECT BY LEVEL <= 30) -- Generating 30 days, adjust this number as needed
),

--  LEFT JOIN the date series with your mileage data and use LAG() to fill in gaps.
combined AS (
    SELECT 
        m.car_id,
        d.created_at,
        COALESCE(m.car_mileage, 
                 LAG(m.car_mileage, 1) OVER (PARTITION BY m.car_id ORDER BY d.created_at)) AS car_mileage
    FROM date_series d
    LEFT JOIN mileage_event m ON d.created_at = m.created_at
)

-- Get result
SELECT * FROM combined WHERE car_id IS NOT NULL ORDER BY car_id, created_at

I hope that answer your question and works. Let me know if you need more help.

UPDATE

The issue is probably the LAG function in the combined CTE. When you are using the LAG function within the LEFT JOIN context, it will not have a car_id value to partition by for the rows that have a null car_mileage. This is why the LAG function doesn't return a value for these rows.

WITH date_series
AS (
    SELECT add_days(CURRENT_DATE - interval '2' year, LEVEL - 1) AS dates
    FROM dual CONNECT BY LEVEL <= days_between(CURRENT_DATE, CURRENT_DATE - interval '2' year) + 1
    ORDER BY dates
    )
    ,joined
AS (
    SELECT m.car_id AS car_id
        ,d.dates
        ,m.car_mileage
                                          
                     
    FROM date_series d
    LEFT JOIN mileage_event m ON d.dates = m.created_at
    )
SELECT car_id
    ,dates AS created_at
    ,COALESCE(car_mileage, LAG(car_mileage, 1) OVER (
            PARTITION BY car_id ORDER BY dates
            )) AS car_mileage
FROM joined
ORDER BY car_id
    ,created_at;
1
Thorsten Kettner On

I don't know Exasol syntax. In SQL you'll usually fill gaps in a series with a recursive query. All you'll do is take each row, look at their following row, if there is any, so you know how many days to add. Then add as many rows as you need.

The following is Oracle syntax. In Exasol you may have to add the keyword RECURSIVE or change the interval to something like interval '1 days' or the like. You can probably find this in the docs.

with cte (car_id, created_at, car_mileage, next_created_at) as
(
  select
    car_id, created_at, car_mileage,
    lead(created_at) over (partition by car_id order by created_at) as next_created_at
  from mytable
  union all
  select car_id, created_at + interval '1' day, car_mileage, next_created_at
  from cte
  where created_at + interval '1' day < next_created_at
)
select car_id, created_at, car_mileage
from cte
order by car_id, created_at;

Demo: https://dbfiddle.uk/61_Ht136