Reading Data Incrementally from S3 in Delta Format

122 views Asked by At

I am working on a project where the data is stored in Delta format on Amazon S3, and I need to read this data incrementally. I am encountering challenges in implementing this, and I would appreciate guidance or insights from the community. My current approach is to leverage transaction json metadata to look for any information regarding modified data on the location.

What I've Tried:

Delta Lake Documentation: I have referred to the Delta Lake documentation, to understand the best practices for reading data incrementally. However there is no concrete information regarding storing Delta format data on S3 or any other files sources although there is a lot about Delta SQL.

I expect to retrieve data incrementally from the S3 location in Delta format. Ideally, I would like suggestions on implementing this scenario.

Environment Details:

Delta Lake Version: 1.0.0 AWS SDK/Library Version: 1.11.375 Programming Language: Java Spark Version - 3.1.2

1

There are 1 answers

0
Kashyap On

You can either use:

CREATE TABLE student (id INT, name STRING, age INT) TBLPROPERTIES (delta.enableChangeDataFeed = true)

and

val df = spark.read.format("delta")
  .option("readChangeFeed", "true")
  .option("startingVersion", 0)
  .table("student")
val df = spark.readStream.format("delta")
  .load("/tmp/delta/events")

import io.delta.implicits._
val df = spark.readStream.delta("/tmp/delta/events")