I am new in data engineering fields but experienced developer. I am currently estimating this use case:
- Having "almost" realtime replication between Postgres(v15) and Delta.io tables (https://docs.delta.io/latest/index.html)
We are looking for an open-source/free solution for the first POC.
What we have drow for now is:
PostgresSQL -> create WAL log -> readed by debezium postgres SQL connector -> Send to kafka -> Send to Spark Structured streaming -> Populate Delta.io tables
What the issue on that architecture is that Delta.io expect a strict schema validation: https://docs.delta.io/latest/delta-batch.html#schema-validation
We are expecting data change in the postgres sources such as column renaming and new column.
How can we make this schema changes automatic ? Is there is a tool in Sparks ou Apache Airflow that we can use that make this schema change automatically with a DML or make the good spark code to make the DDL ? (https://docs.delta.io/latest/delta-batch.html#update-table-schema)
Readed doc and looking for advise