How do range queries work with Sorted String Tables?

1.1k views Asked by At

I'm a bit confused. I cannot find any information about how to execute a range query against a sorted string table.

LevelDB and RocksDB support a range iterator which allows you to query between ranges, which is perfect for NoSQL. What I don't understand is how it is implemented to be efficient.

The tables are sorted in memory (and on disk) - what algorithm or data structure allows one to query a Sorted String Table efficiently in a range query? Do you just loop through the entries and rely on the cache lines being full of data?

Usually I would put a prefix tree in front, and this gives me indexing of keys. But I am guessing Sorted String Tables do something different and take advantage of sorting in some way.

1

There are 1 answers

0
midor On

Each layer of the LSM (except for the first one) is internally sorted by the key, so you can just keep an iterator into each layer and use the one pointing to the lexicographically smallest element. The files of a layer look something like this on disk:

Layer N
---------------------------------------
File1    | File2 | File3 | ... | FileN     <- filename
n:File2  |n:File3|n:File4| ... |           <- next file
a-af     | af-b  | b-f   | ... | w-z       <- key range
---------------------------------------
aaron    | alex  | brian | ... | walter    <- value omitted for brevity, but these are key:value records
abe      | amy   | emily | ... | xena
...      | ...   | ...   | ... | ...
aezz     | azir  | erza  | ... | zoidberg
---------------------------------------

First Layer (either 0 or 1)
---------------------------------------
File1    | File2 |     ...     | FileK
alex     | amy   |     ...     | andy
ben      | chad  |     ...     | dick
...      | ...   |     ...     | ...
xena     | yen   |     ...     | zane
---------------------------------------
...

Assume that you are looking for everything in the range ag-d (exclusive). A "range scan" is just to find the first matching element and then iterate the files of the layer. So you find that File2 is the first to contain any matching elements, and scan up to the first element starting with 'ag'. You iterate over File2, then look at the next file for File2 (n:File3). You check the key-range it contains and find that it contains more elements from the range you are interested in, so you iterate it until you hit the first entry starting with 'd'. You do the same thing in every layer, except the first. The first layer has files which are not sorted among each other, but they are internally sorted, so you can just keep an iterator per file. You also keep one more for the current memtables (in-memory data, only persisted in a log).

This never becomes too expensive, because the first layer is typically compacted on a small constant threshold. As the files in every layer are sorted and the files are internally sorted by the key too, you can just advance the smallest iterator until all iterators are exhausted. Apart from the initial search, every step has to look at a fixed number of iterators (assuming a naive approach) and is thus O(1). Most LSMs employ a block cache, and thus the sequential reads typically hit the cache most of the time.

Last but not least, be aware that this is mostly a conceptual explanation, because most implementations have a few extra tricks up their sleeves that make these things more efficient. You have to know which data is contained in which file-ranges anyway when you do a major compaction, i.e., merge layer N in to layer N + 1. Even the file-level operation may look quite different: RocksDB, e.g., maintains a coarse index with the key offsets at the beginning of each file to avoid scanning over the often much larger key/value pair portion of the file.