I have an application that is based on flink-sql version 1.17.1 and reads data from Kafka topics.
CREATE TEMPORARY VIEW distinct_table1 AS
SELECT *
FROM (SELECT *,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY change_date desc) AS rownum
FROM table1)
WHERE rownum = 1;
CREATE TEMPORARY VIEW distinct_table2 AS
SELECT *
FROM (SELECT *,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY change_date desc) AS rownum
FROM table2)
WHERE rownum = 1;
..
..
..
CREATE TEMPORARY VIEW distinct_table12 AS
SELECT *
FROM (SELECT *,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY change_date desc) AS rownum
FROM table12)
WHERE rownum = 1;
Insert into enriched_tables
Select t1.col1,t1.col2..,t1.coln,t2.col1,t2.col2,t2.coln, ... t12.col1,t12.coln
From distinct_table t1
inner join distinct_table t2 on t1.id=t2.t1_id
inner join distinct_table t3 on t2.id=t3.t2_id
..
..
inner join distinct_table t12 on t11.id=t12.t11_id
Note:
- Because of the business logic/constraint this can't be changed to temporal join hence we can't define TTL hence have large and infinite state
Question#1: Assuming i have all the data in kafka topics, i start this application in batch mode first to establish initial state and then restart from savepoint in streaming mode ? Assuming batch mode will store the checkpoint of bounded offset?
Question#2: About 3000+ clients will be streaming cdc + initial snapshot to above 12 cdc topics and will be onboarded gradually. Large clients about 700 millions records in largest table takes about 6 hours to produce final enrich topic/table. Should we expect improved performance with this approach ?