how dose yugabytedb guarantee the snapshot consistent during garbage collection?

160 views Asked by At

For example:

  1. There are two items, k1 and k2, with time t1.
  2. Then, a read transaction(A) get a snapshot whose time is t1.And transaction(A) read k1 with t1 successfully.
  3. At the same time, another transaction(B) write k2 with time t2(t2>t1).
  4. yugabytedb does garbage collection in some way, so the k1 with t1 will be delete.
  5. If transaction(A) read k2 with time t1, it will can't find any version of k2 with time less equal than t1.

I am confused how yugabytedb maintains a consistent snapshot.

I almost searched yugabytedb's transaction documents, but I didn't find anything related to garbage collection.

I have seen some descriptions of google spanner about garbage collection, which is to keep the old version for one hour.But yugabytedb use HLC instead of Truetime.

Can anyone introduce yugabytedb's garbage collection mechanism? Is it the same as spanner?

1

There are 1 answers

3
Frits Hoogland On

The general mechanism that YugabyteDB uses for consistency is using Hybrid Logical Clocks (HLC). See this presentation: HLC. Using HLC, transactions can pick a version of a row consistent with its transaction time. I do believe this is already known.

Because we use LSM tree storage, no data is overwritten. An updated row means another another entry for the same row, with a different HLC timestamp. This way, when a row is requested, a transaction can pick a row version consistent with its HLC.

Garbage collection, alias the purging of old, non-current versions of a row happens during major compaction. Major compaction is the process of merging SST files. If lots of data is changed, major compactions could happen happen after a short amount of time, therefore we implemented a parameter/gflag --timestamp_history_retention_interval_sec to guarantee a minimal of time a change remains available.

Obviously if the amount of changes is low, a non-current version could remain available for a long time.

Compaction happens per rocksdb database, which is per tablet of a table or secondary index.