DeltaLake: How to Time Travel infinitely across Datasets?

Question

DeltaLake: How to Time Travel infinitely across Datasets?

2.2k views Asked by Siddharth Singh At 09 October 2020 at 05:18

The Use Case:

Store versions of Large Datasets (CSV/Snowflake Tables) and query across versions

DeltaLake says that unless we run vacuum command we retain historical information in a DeltaTable. And Log files are deleted every 30 days. Here

And Additional Documentation states that we need both the log files and DataFiles to time travel. here

Does this imply that we can only time travel 30 days?

But isn't Delta a file format? How would it automatically delete it's logs?

If yes, what are the other open source versions that can solve querying across dataset versions.?

Original Q&A

There are 2 answers

Swapnil Chougule On 10 October 2020 at 07:10

Databricks has open sourced deltalake project in Apr'2018 (Open source deltalake project still to get some functionalities like data skipping etc) Details: Deltalake, Docs, Github Repo

Delta is not file format - it is storage layer on top of parquet & metadata (in json format) files.

It doesn't delete files automatically. Vacuum operation should be performed to delete older & not referenced (not active) file. So without running 'vacuum' operation, you can time travel infinitely as all data would be available. On other hand, if you perform 'vacuum' with 30 days retention, you can access last 30 days data.

Yes, it solves querying across dataset versions. Each version can be identified by timestamp. Sample queries to access specific version data:

Scala:

val df = spark.read
  .format("delta")
  .option("timestampAsOf", "2020-10-01")
  .load("/path/to/my/table")

Python:

df = spark.read \
  .format("delta") \
  .option("timestampAsOf", "2020-10-01") \
  .load("/path/to/my/table")

SQL:

SELECT count(*) FROM my_table TIMESTAMP AS OF "2010-10-01"
SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
SELECT count(*) FROM my_table TIMESTAMP AS OF "2010-10-01 01:30:00.000"

(Note: I am using open sourced deltalake in production for multiple use cases)

**Siddharth Singh** · Accepted Answer · 2021-03-04T10:09:35+00:00

Just set the data and log retention settings to a very long period.

alter table delta.`/path/to/table` set TBLPROPERTIES ('delta.logRetentionDuration'='interval 36500000 days', 'delta.deletedFileRetentionDuration'='interval 36500000 days')

spark.sql("alter table delta.`{table_path}` set TBLPROPERTIES ("
                      "'delta.logRetentionDuration'='interval {log_retention_days} days', "
                      "'delta.deletedFileRetentionDuration'='interval {data_rentention_days} days');".format(
                        table_path="path/to/table",
                        log_retention_days=36000000,
                        data_rentention_days=36000000))

TechQA.

DeltaLake: How to Time Travel infinitely across Datasets?

There are 2 answers

Related Questions in DATABASE

Related Questions in DATAFRAME

Related Questions in APACHE-SPARK

Related Questions in DELTA-LAKE

Popular Questions

Popular Tags

Trending Questions