calculate weighted average for each day and id based on time intervals in PostgreSQL

947 views Asked by At

I have a table in a PostgreSQL database that looks like this:

stid | e5   | e10  | diesel | date
-----+------+------+--------+------------------------
e850 | 1300 | 1400 | 1500   | 2016-05-02 05:30:01+02
e850 | 1400 | 1500 | 1700   | 2016-05-02 08:30:01+02
e850 | 1300 | 1400 | 1500   | 2016-05-02 21:00:01+02
e850 | 1200 | 1300 | 1350   | 2016-05-03 10:30:01+02
e850 | 1300 | 1400 | 1500   | 2016-05-03 21:00:01+02
954d | 1200 | 1100 | 1300   | 2016-05-02 03:30:01+02
954d | 1300 | 1100 | 1300   | 2016-05-02 15:00:01+02
954d | 1400 | 1800 | 1400   | 2016-05-02 22:30:01+02
954d | 1700 | 1900 | 1400   | 2016-05-03 09:30:01+02
954d | 1500 | 1900 | 1200   | 2016-05-03 23:30:01+02

So I have unique id's (stid), prices (e5,e10,diesel) and a timestamp (date) which indicates when the price was introduced. Now I want to calculate the average price per day and stid, weighted by the duration the price was charged. And I only want to take the period between 8 am and 8 pm into account.

To calculate the weighted average price of e5 for stid e850 and the date 2016-05-02 between 8 am and 8 pm I would do the following:

(1300 * 1801 + 1400 * 41399) / 43200 = 1395.83102

1300 is the price that was set at 5:30:01 am and 1801 is the duration in 
seconds between 8 am and 8:30:01 am.
1400 is the price that was set at 8:30:01 am and 41399 is the duration in 
seconds between 8:30:01 am and 8 pm.

In the end I would like to have a table that looks like this:

stid | date       | average_e5 | average_e10 | average_diesel
-----+------------+------------+-------------+---------------
e850 | 2016-05-02 | 1395.83102 | 1495.83102  | 1691.66204
e850 | 2016-05-03 | 1220.83565 | 1320.83565  | 1381.25347
954d | 2016-05-02 | 1241.66435 | 1100        | 1300
954d | 2016-05-03 | 1662.49306 | 1887.49769  | 1400

Edit: Solution

The code below from Vao Tsun's answer does almost everything that I am looking for. However, when there is no price before 8 am or after 8 pm for a day and id, I don't get the weighted average that I am looking for. But by creating dummy entries for the cases when there are no prices before 8 am or after 8 pm, I was able to solve this problem.

I used the following code to create a new table called mytable2, that inlcudes the dummy entries.

DROP TABLE IF EXISTS mytable2;

CREATE TABLE mytable2 AS SELECT * FROM mytable;

WITH c AS (
SELECT
    *,
    LAG(date) OVER(PARTITION BY stid ORDER BY date) AS lag_date,
    LAG(e5) OVER(PARTITION BY stid ORDER BY date) AS lag_e5,
    LAG(e10) OVER(PARTITION BY stid ORDER BY date) AS lag_e10,
    LAG(diesel) OVER(PARTITION BY stid ORDER BY date) AS lag_diesel
FROM mytable
)

INSERT INTO mytable2
SELECT
    stid,
    lag_e5 AS e5,
    lag_e10 AS e10,
    lag_diesel AS diesel,
    date_trunc('day', date) + '0 hours'::interval AS date
FROM c WHERE lag_date < date_trunc('day', date) + '0 hours'::interval
AND date > date_trunc('day', date) + '8 hours'::interval;

WITH d AS (
SELECT
    *,
    LEAD(date) OVER(PARTITION BY stid ORDER BY date) AS lead_date
FROM mytable
)

INSERT INTO mytable2
SELECT
    stid,
    e5,
    e10,
    diesel,
    date_trunc('day', date) + '23 hours'::interval AS date
FROM d WHERE lead_date >= date_trunc('day', date) + '24 hours'::interval
AND date < date_trunc('day', date) + '20 hours'::interval;

Then I can just run the code from Vao Tsun's answer to get the desired weighted averages. I only changed mytable to mytable2, to use the table with the added dummy entries.

with a as (
select *
, case
  when date < date_trunc('day', date) + '8 hours'::interval then date_trunc('day', date) + '8 hours'::interval
  when date > date_trunc('day', date) + '20 hours'::interval then date_trunc('day', date) + '20 hours'::interval
  else date
end d
, date_trunc('day', date) dt
from mytable2
)
, b as (
select stid, e5, e10, diesel,date,d, dt
, extract(epoch from lead(d) over (partition by stid,dt order by stid,d) - d) diff
from a
)
select DISTINCT
 stid, dt,sum(e5*diff*1.0) over (partition by stid,dt)/sum(diff) over (partition by stid,dt) e5_weight_avg
from b
order by stid desc, dt;
stid |         dt          |  e5_weight_avg
-----+---------------------+-----------------
e850 | 2016-05-02 00:00:00 | 1395.83101851852
e850 | 2016-05-03 00:00:00 | 1220.83564814815
954d | 2016-05-02 00:00:00 | 1241.66435185185
954d | 2016-05-03 00:00:00 | 1662.49305555556

The code can also be found here rextester

1

There are 1 answers

0
Vao Tsun On BEST ANSWER

I did some not needed CTE, to make it more readable:

t=# with a as (
select *
, case
  when date < date_trunc('day', date) + '8 hours'::interval then date_trunc('day', date) + '8 hours'::interval
  when date > date_trunc('day', date) + '20 hours'::interval then date_trunc('day', date) + '20 hours'::interval
  else date
end d
, date_trunc('day', date) dt
from mytable
)
, b as (
select stid, e5, e10, diesel,date,d, dt
, extract(epoch from lead(d) over (partition by stid,dt order by stid,d) - d) diff
from a
)
select
 stid, e5,date,d, diff,sum(e5*diff*1.0) over (partition by stid,dt)/sum(diff) over (partition by stid,dt) e5_weight_avg
from b
order by stid desc, date;
 stid |   e5    |        date         |          d          | diff  |  e5_weight_avg
------+---------+---------------------+---------------------+-------+------------------
 e850 | 1300.00 | 2016-05-02 05:30:01 | 2016-05-02 08:00:00 |  1801 | 1395.83101851852
 e850 | 1400.00 | 2016-05-02 08:30:01 | 2016-05-02 08:30:01 | 41399 | 1395.83101851852
 e850 | 1300.00 | 2016-05-02 21:00:01 | 2016-05-02 20:00:00 |       | 1395.83101851852
 e850 | 1200.00 | 2016-05-03 10:30:01 | 2016-05-03 10:30:01 | 34199 |             1200
 e850 | 1300.00 | 2016-05-03 21:00:01 | 2016-05-03 20:00:00 |       |             1200
 954d | 1200.00 | 2016-05-02 03:30:01 | 2016-05-02 08:00:00 | 25201 | 1241.66435185185
 954d | 1300.00 | 2016-05-02 15:00:01 | 2016-05-02 15:00:01 | 17999 | 1241.66435185185
 954d | 1400.00 | 2016-05-02 22:30:01 | 2016-05-02 20:00:00 |       | 1241.66435185185
 954d | 1700.00 | 2016-05-03 09:30:01 | 2016-05-03 09:30:01 | 37799 |             1700
 954d | 1500.00 | 2016-05-03 23:30:01 | 2016-05-03 20:00:00 |       |             1700
(10 rows)

and thus, skipping middle steps:

t=# with a as (
select *
, case
  when date < date_trunc('day', date) + '8 hours'::interval then date_trunc('day', date) + '8 hours'::interval
  when date > date_trunc('day', date) + '20 hours'::interval then date_trunc('day', date) + '20 hours'::interval
  else date
end d
, date_trunc('day', date) dt
from mytable
)
, b as (
select stid, e5, e10, diesel,date,d, dt
, extract(epoch from lead(d) over (partition by stid,dt order by stid,d) - d) diff
from a
)
select DISTINCT
 stid, dt,sum(e5*diff*1.0) over (partition by stid,dt)/sum(diff) over (partition by stid,dt) e5_weight_avg
from b
order by stid desc, dt;
 stid |         dt          |  e5_weight_avg
------+---------------------+------------------
 e850 | 2016-05-02 00:00:00 | 1395.83101851852
 e850 | 2016-05-03 00:00:00 |             1200
 954d | 2016-05-02 00:00:00 | 1241.66435185185
 954d | 2016-05-03 00:00:00 |             1700
(4 rows)