Flink-sql initial load in batch mode and then change to streaming mode

44 views Asked by At

I have an application that is based on flink-sql version 1.17.1 and reads data from Kafka topics.

CREATE TEMPORARY VIEW distinct_table1 AS
SELECT *
FROM (SELECT *,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY change_date desc) AS rownum
FROM table1)
WHERE rownum = 1;


CREATE TEMPORARY VIEW distinct_table2 AS
SELECT *
FROM (SELECT *,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY change_date desc) AS rownum
FROM table2)
WHERE rownum = 1;
..
..
..
CREATE TEMPORARY VIEW distinct_table12 AS
SELECT *
FROM (SELECT *,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY change_date desc) AS rownum
FROM table12)
WHERE rownum = 1;



Insert into enriched_tables
Select t1.col1,t1.col2..,t1.coln,t2.col1,t2.col2,t2.coln, ... t12.col1,t12.coln
From distinct_table t1 
 inner join distinct_table t2 on t1.id=t2.t1_id 
 inner join distinct_table t3 on t2.id=t3.t2_id 
 ..
 ..
 inner join distinct_table t12 on t11.id=t12.t11_id 

Note:

  • Because of the business logic/constraint this can't be changed to temporal join hence we can't define TTL hence have large and infinite state

Question#1: Assuming i have all the data in kafka topics, i start this application in batch mode first to establish initial state and then restart from savepoint in streaming mode ? Assuming batch mode will store the checkpoint of bounded offset?

Question#2: About 3000+ clients will be streaming cdc + initial snapshot to above 12 cdc topics and will be onboarded gradually. Large clients about 700 millions records in largest table takes about 6 hours to produce final enrich topic/table. Should we expect improved performance with this approach ?

0

There are 0 answers