DeltaLake: How to Time Travel infinitely across Datasets?

2.2k views Asked by At

The Use Case:

  • Store versions of Large Datasets (CSV/Snowflake Tables) and query across versions

DeltaLake says that unless we run vacuum command we retain historical information in a DeltaTable. And Log files are deleted every 30 days. Here

And Additional Documentation states that we need both the log files and DataFiles to time travel. here

Does this imply that we can only time travel 30 days?

But isn't Delta a file format? How would it automatically delete it's logs?

If yes, what are the other open source versions that can solve querying across dataset versions.?

2

There are 2 answers

1
Siddharth Singh On BEST ANSWER

Just set the data and log retention settings to a very long period.

alter table delta.`/path/to/table` set TBLPROPERTIES ('delta.logRetentionDuration'='interval 36500000 days', 'delta.deletedFileRetentionDuration'='interval 36500000 days')
spark.sql("alter table delta.`{table_path}` set TBLPROPERTIES ("
                      "'delta.logRetentionDuration'='interval {log_retention_days} days', "
                      "'delta.deletedFileRetentionDuration'='interval {data_rentention_days} days');".format(
                        table_path="path/to/table",
                        log_retention_days=36000000,
                        data_rentention_days=36000000))
0
Swapnil Chougule On

Databricks has open sourced deltalake project in Apr'2018 (Open source deltalake project still to get some functionalities like data skipping etc) Details: Deltalake, Docs, Github Repo

Delta is not file format - it is storage layer on top of parquet & metadata (in json format) files.

It doesn't delete files automatically. Vacuum operation should be performed to delete older & not referenced (not active) file. So without running 'vacuum' operation, you can time travel infinitely as all data would be available. On other hand, if you perform 'vacuum' with 30 days retention, you can access last 30 days data.

Yes, it solves querying across dataset versions. Each version can be identified by timestamp. Sample queries to access specific version data:

Scala:

val df = spark.read
  .format("delta")
  .option("timestampAsOf", "2020-10-01")
  .load("/path/to/my/table")

Python:

df = spark.read \
  .format("delta") \
  .option("timestampAsOf", "2020-10-01") \
  .load("/path/to/my/table")

SQL:

SELECT count(*) FROM my_table TIMESTAMP AS OF "2010-10-01"
SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
SELECT count(*) FROM my_table TIMESTAMP AS OF "2010-10-01 01:30:00.000"

(Note: I am using open sourced deltalake in production for multiple use cases)