why to use kafka to store cdc data instead directly consume by spark?

243 views Asked by At

I want to consume CDC data from multiple data sources for example cassandra, mysql, Oracle ...etc . I have gone through some documentation to stream cdc data to kafka and store data into topics . I was thinking can't I write spark programs to consume data directly from source , instead first pushing data into kafka topics and then spark program connecting to kafka topics to consume message further . Here are my few questions , I am trying to figure out answer:

  1. what is importance of using kafka in between instead directly consuming changed records from spark ?
  2. using kafka in mid won't add some latency to system ?
1

There are 1 answers

3
Erick Ramirez On

You certainly can write your own Spark apps that can consume the data but doing so feels like you're reinventing the wheel. Kafka is solving this for you so you don't have to.

In addition, Kafka supports taking input from various sources as well publishing the data to multiple subscribers including Spark apps.

With Kafka, it's easier to build apps since there are connectors available for most technologies. Cheers!