Timescaledb Hierarchical continuous aggregates generate duplicated time bucket

189 views Asked by At

I'm new in Timescaledb and I'm exploring the continuous aggregates. I use the docker image timescale/timescaledb-ha:pg14 (PostgreSQL 14 with Timescaledb v2.11.1)

I have historical data which are half-hourly values for several devices and I want to aggregate these values in hourly values and then in yearly values. I've created a hypertable to store half-hourly data. Then, I created a first continuous aggregate for the hourly values and on top a second continuous aggregate for the yearly values. Everything seems to be going well but at the end for the yearly aggregates I have a duplicate time bucket for each device for the last year.

Create hypertable

CREATE TABLE observation_hah_timescaledb (
    device_id integer NOT NULL,
    observation_timestamp TIMESTAMPTZ NOT NULL,
    value double precision
);

SELECT create_hypertable('observation_hah_timescaledb','observation_timestamp');

Insert data

I insert a data set from 2018 to 2022 in the hypertable. For a historical reason, the timestamp of these data points to the end of the time period. So I have to shift the data by half an hour. I replace the empty values (-9999) with NULL values.

INSERT INTO observation_hah_timescaledb (
                                                observation_timestamp,
                                                device_id,
                                                value
)
SELECT
    date_time - INTERVAL '30 min',
    device_id,
    NULLIF(value, -9999)
FROM observation_hah
WHERE
    date_time >= '2018-01-01 00:00:00';

Continuous aggregate 1 (hourly)

With the half-hourly values, I aggregate data by hour.

CREATE MATERIALIZED VIEW observation_hourly_cont_agg
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1hour', observation_timestamp) AS time_bucket,
    device_id,
    AVG(value) AS hourly_value,
    count(value)/2 AS data_acq
FROM
    observation_hah_timescaledb
GROUP BY
    time_bucket('1hour', observation_timestamp), device_id;

Select data

I select the data for one device and for the last year and I get 8760 rows (24*365). The aggregation values are correct.

SELECT
    *
FROM observation_hourly_cont_agg
WHERE
    device_id=6627
AND
    time_bucket>='2022-01-01';

Continuous aggregate 2 (yearly)

I create an continuous aggregate for the yearly values on top of the first continuous aggregate (hourly values).

CREATE MATERIALIZED VIEW observation_yearly_cont_agg
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1year', observation_hourly_cont_agg.time_bucket) AS time_bucket,
    device_id,
    AVG(hourly_value) AS avg_year_value,
    COUNT(observation_hourly_cont_agg.time_bucket),
    MIN(observation_hourly_cont_agg.time_bucket),
    MAX(observation_hourly_cont_agg.time_bucket)

FROM
    observation_hourly_cont_agg
WHERE
    data_acq = 1
GROUP BY
    time_bucket('1year', observation_hourly_cont_agg.time_bucket), device_id
;

Select data

I select the data for the device. These values are correct, but I get an extra row for the last year for each device. This last row includes only a part of the data from september to december.

SELECT
    *
FROM observation_yearly_cont_agg
WHERE
    device_id=6627;
time_bucket device_id avg_year_value count min max
2018-01-01 00:00:00.000000 +00:00 6627 11.25464132507152 8392 2018-01-01 03:00:00.000000 +00:00 2018-12-31 23:00:00.000000 +00:00
2019-01-01 00:00:00.000000 +00:00 6627 9.57086852388762 7933 2019-01-01 00:00:00.000000 +00:00 2019-12-31 23:00:00.000000 +00:00
2020-01-01 00:00:00.000000 +00:00 6627 8.461547559760978 8032 2020-01-01 01:00:00.000000 +00:00 2020-12-31 23:00:00.000000 +00:00
2021-01-01 00:00:00.000000 +00:00 6627 9.730555202642625 7871 2021-01-01 00:00:00.000000 +00:00 2021-12-31 23:00:00.000000 +00:00
2022-01-01 00:00:00.000000 +00:00 6627 9.444980719794348 7780 2022-01-01 00:00:00.000000 +00:00 2022-12-31 23:00:00.000000 +00:00
2022-01-01 00:00:00.000000 +00:00 6627 9.19524472741742 2513 2022-09-04 00:00:00.000000 +00:00 2022-12-31 23:00:00.000000 +00:00

Am I doing something wrong?

I found a post on the timescaledb forum that describes similar behavior: Continuous Aggregate gets wrong data and duplicated time bucket after refreshing

When I refresh the continuous aggregates nothing happens.

CALL refresh_continuous_aggregate('observation_hourly_cont_agg', '2021-01-01', '2023-01-01');
CALL refresh_continuous_aggregate('observation_yearly_cont_agg', '2021-01-01', '2023-01-01');

Can anyone help me?

1

There are 1 answers

1
chdegrave On

I finally found a solution.

First, I tried to create continuous aggregates with no data but the views were populated with data.

CREATE MATERIALIZED VIEW observation_hourly_cont_agg
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1hour', observation_timestamp) AS time_bucket,
    device_id,
    AVG(value) AS hourly_value,
    count(value)/2 AS data_acq
FROM
    observation_hah_timescaledb
GROUP BY
    time_bucket('1hour', observation_timestamp), device_id
WITH NO DATA;

So, I have looked at this question: TimescaleDB Continuous Aggregates WITH NO DATA not working

It is suggested to disable the real time aggregation.

I re-created the continuous aggregates:

CREATE MATERIALIZED VIEW observation_hourly_cont_agg
WITH (timescaledb.continuous, timescaledb.materialized_only=True) AS
SELECT
    time_bucket('1hour', observation_timestamp) AS time_bucket,
    device_id,
    AVG(value) AS hourly_value,
    count(value)/2 AS data_acq
FROM
    observation_hah_timescaledb
GROUP BY
    time_bucket('1hour', observation_timestamp), device_id
WITH NO DATA;
CREATE MATERIALIZED VIEW observation_yearly_cont_agg
WITH (timescaledb.continuous, timescaledb.materialized_only=True) AS
SELECT
    time_bucket('1year', observation_hourly_cont_agg.time_bucket) AS time_bucket,
    device_id,
    AVG(hourly_value) AS avg_year_value,
    COUNT(observation_hourly_cont_agg.time_bucket),
    MIN(observation_hourly_cont_agg.time_bucket),
    MAX(observation_hourly_cont_agg.time_bucket)

FROM
    observation_hourly_cont_agg
WHERE
    data_acq = 1
GROUP BY
    time_bucket('1year', observation_hourly_cont_agg.time_bucket), device_id
WITH NO DATA;

I populated them:

CALL refresh_continuous_aggregate('observation_hourly_cont_agg', '2017-01-01', '2023-01-01');
CALL refresh_continuous_aggregate('observation_yearly_cont_agg', '2017-01-01', '2023-01-01');

Now, I'm getting what I expected:

SELECT
    *
FROM observation_yearly_cont_agg
WHERE
    device_id=6627
ORDER BY
    time_bucket;
time_bucket device_id avg_year_value count min max
2018-01-01 00:00:00.000000 +00:00 6627 11.254641325071463 8392 2018-01-01 03:00:00.000000 +00:00 2018-12-31 23:00:00.000000 +00:00
2019-01-01 00:00:00.000000 +00:00 6627 9.570868523887526 7933 2019-01-01 00:00:00.000000 +00:00 2019-12-31 23:00:00.000000 +00:00
2020-01-01 00:00:00.000000 +00:00 6627 8.461547559761014 8032 2020-01-01 01:00:00.000000 +00:00 2020-12-31 23:00:00.000000 +00:00
2021-01-01 00:00:00.000000 +00:00 6627 9.730555202642627 7871 2021-01-01 00:00:00.000000 +00:00 2021-12-31 23:00:00.000000 +00:00
2022-01-01 00:00:00.000000 +00:00 6627 9.444980719794358 7780 2022-01-01 00:00:00.000000 +00:00 2022-12-31 23:00:00.000000 +00:00

Pay attention to your version of the timescale (Timescale doc - Real time aggregates): "In Timescale versions 1.7 to 2.12, real time aggregates are enabled by default; when you create a continuous aggregate view, queries to that view include the most recent data, even if it has not yet been aggregated. In Timescale 2.13 and later real time aggregates are DISABLED by default."