ClickHouse Materialized View consuming a lot of Memory and CPU

37 views Asked by At

TMP table definition (where we inserting all the data)

CREATE TABLE guardicore.mv_test_data_tmp
(
    `some_id` UUID,
    `count` Nullable(UInt32),
    `aggr_id` FixedString(64),
    `id` UUID,
    `start_time` DateTime64(6),
)
ENGINE = MergeTree
PARTITION BY toDate(start_time)
ORDER BY start_time
TTL toDateTime(start_time) + toIntervalHour(1)
SETTINGS allow_nullable_key = 1, index_granularity = 8192

Final table definition (for aggregated data)

CREATE TABLE guardicore.mv_test_data
(
    `some_id` UUID,
    `count` Nullable(UInt32),
    `aggr_id` FixedString(64),
    `id` UUID,
    `start_time` DateTime64(6),
)
ENGINE = SummingMergeTree
PARTITION BY toDate(start_time)
ORDER BY (start_time, aggr_id)
SETTINGS allow_nullable_key = 1, index_granularity = 8192

Materialized view for final table:

CREATE MATERIALIZED VIEW guardicore.mv_test_data_mv TO guardicore.mv_test_data
(
    `aggr_id` FixedString(64),
    `start_time` DateTime,
    `count` Nullable(UInt64),
    `some_id` UUID,
    `id` UUID,
) AS
SELECT
    aggr_id,
    toStartOfFifteenMinutes(start_time) AS start_time,
    sum(count) AS count,
    anyLast(some_id) AS some_id,
FROM guardicore.mv_test_data_tmp
GROUP BY
    aggr_id,
    start_time
ORDER BY
    aggr_id ASC,
    start_time ASC

Issue is that when MV is on and i am doing aggregation. ClickHouse consumes around 2-3 times more Memory and CPU. Am i missed something in definitions or should check some configurations? Currently all configurations a default CH configurations. Test was with 300,000,000 inserts to TMP table during 4 hours

1

There are 1 answers

1
Slach On

Nullable keys have no sense for count field just insert 0 value Nullable fields have bad performance

better use ORDER BY (aggr_id, start_time) for guardicore.mv_test_data for future selects

memory and CPU usage depends on cardinality for combinations values for aggr_id, start_time in each INSERT statement, more combinations more memory and CPU for GROUP BY and ORDER BY for processing MATERIALIZED VIEW trigger

could you share

SELECT 
 count(), 
 uniq(tuple( aggr_id, toStartOfFifteenMinutes(start_time))) AS uniq_combinations 
FROM guardicore.mv_test_data_tmp

to try to understand efficiency of your aggregation?

moreover, I would like propose use different field name instead of start_time in MV and destination table

let's try something like that

CREATE TABLE guardicore.mv_test_data_tmp
(
    `id` UUID,
    `some_id` UUID,
    `count` UInt32,
    `aggr_id` FixedString(64),
    `start_time` DateTime64(6)
)
ENGINE = MergeTree
PARTITION BY toDate(start_time)
ORDER BY start_time
TTL toDateTime(start_time) + toIntervalHour(1);

CREATE TABLE guardicore.mv_test_data
(
    `id` UUID,
    `some_id` UUID,
    `count` Nullable(UInt32),
    `aggr_id` FixedString(64),
    `start_15m` DateTime
)
ENGINE = SummingMergeTree
PARTITION BY toDate(start_15m)
ORDER BY (aggr_id, start_15m);


CREATE MATERIALIZED VIEW guardicore.mv_test_data_mv TO guardicore.mv_test_data
(
    `aggr_id` FixedString(64),
    `start_time` DateTime,
    `count` Nullable(UInt64),
    `some_id` UUID,
    `id` UUID,
) AS
SELECT
    aggr_id,
    anyLast(id) AS id,
    anyLast(some_id) AS some_id,
    toStartOfFifteenMinutes(start_time) AS start_15m,
    sum(count) AS count,
FROM guardicore.mv_test_data_tmp
GROUP BY
    aggr_id,
    start_15m
ORDER BY
    aggr_id,
    start_15m