We are performing the POC with KSQLDB and some doubts :-
I have a Kafka topic named USERPROFILE
which have around 100 million unique records and 10 days retention policy. This Kafka topic continues to receive INSERT/UPDATE type of events in real-time from its underlying RDBMS table.
Following is the simple structure of the record being received in this kafka topic :-
{"userid":1001,"firstname":"Hemant","lastname":"Garg","countrycode":"IND","rating":3.7}
1.) We have opened a Kafka Stream on this aforesaid TOPIC :-
create STREAM userprofile_stream (userid INT, firstname VARCHAR, lastname VARCHAR, countrycode VARCHAR, rating DOUBLE) WITH (VALUE_FORMAT = 'JSON', KAFKA_TOPIC = 'USERPROFILE')>;
2.) Because, there can be updates for given userId and we want only unique records (for each userId), we have also opened another Kafka Table on this aforesaid TOPIC :-
ksql> create TABLE userprofile_table(userid VARCHAR PRIMARY KEY, firstname VARCHAR, lastname VARCHAR, countrycode VARCHAR, rating DOUBLE) WITH (KAFKA_TOPIC = 'USERPROFILE', VALUE_FORMAT = 'DELIMITED');
Questions are :-
Does it takes extra space on the Disk to open the KTable ? For example, Kafka topic have 100 million records, would the same records be also present in the KTable OR Is it just some virtual view on the underlying kafka topic ?
Same question for the stream that we have opened. Does it takes extra space on the Disk (of the Brokers servers) to open the KStream ? For example, Kafka topic have 100 million records, would the same records be also present in the KStream OR Is it just some virtual view on the underlying kafka topic ?
Say, we received record with id as 1001 on 1st May, then on 11th May, that record would no more be available on Kafka topic, But Whether that record would still be present on kstream / Ktable ? Are there some retention policy for KStream / KTable as well like we have for Topic as such ?
Answers shall be highly appreciated.
-- Best aditya
ksqlDB server is powered by Kafka Streams. As a result, when you create a stream or a table, the server will create a KStream or KTable respectively.
On top of that KStream and KTables are backed up by topics in Kafka. As a result, creating streams and tables on a ksqlDB server will create actual topics on your Kafka cluster.
Having said that, the streams and tables from ksqlDB are materialized on need and pretty optimized, those two articles from Confluent gives more insights on the internal behaviour with good visual help:
You can even take a look by yourself at the created data. For the sake of example, I created:
MESSAGES_STREAM
stream from the original topicMATERIALIZED_MESSAGES_STTREAM
from the stream aboveMESSAGES
table from the first streamHere are the creation commands for reference:
By looking at the details in ksqlDB, we can see that the first stream is using the original topic as source:
And looking at the declared topic on our cluster, we can see the second stream, the table and its
changelog
topic created under the hood:You can also see that the retention policies are different between a stream and a table. The former will delete old records, while the latter will compact the data:
TL;DR, to go back on your questions: