Delta Lake Size Requirements

97 views Asked by At

Does delta lake store the changes within file, or just of whole files? Does it just record an deletion of the previous table and an addition of a new, modified table? Or does it record insert/update/deletes on the row level?

Suppose you have a 100MiB table and you change a single row representing 1KiB of data, and you make 100 such changes. Will it take up approximately 100 * 1KiB of space or 100 * 100MiB?

This may depend on the engine, so an answer that differs by engine is acceptable.

2

There are 2 answers

2
Test On

Delta lake efficiently stores the history as updates to individual rows, not a copy of the table for every change. So it will take O(100 * 1KiB).

2
Christopher Grant On

With Deletion Vectors enabled on version 2.3.0+ of the Spark connector, only the row-level deltas are written out so that mutating operations like DELETE and UPDATE do not always cause entire files to be written out.

Without this feature enabled, you'd rewrite entire files. Taking your example of a table with total size 100MiB and constantly rewriting a 1KiB file, you'd rewrite that 1KiB file 100 times, resulting in 100KiB of re-writes.

With deletion vectors, you'd write a fraction of this, I found a single-value vector file to be ~50 bytes, which is 1/20 the size of your given situation, resulting in 5KiB of writes total. This ratio is already large but becomes a larger factor with real deployments, where files are in the MiB/GiB range, becoming a 5-6 order of magnitude difference in the amount of data written.