How does HBase performs a lookup and retrieves a record ? e.g., what is the equivalent in HBase for the RDBMS's B-Trees?
[EDIT]
I understand how HBase resolves the -ROOT-, and .META. tables to find out which region holds the data. But how is the local lookup performed?
To better illustrated, here's an example:
- I am starting a search (get or scan) for record with key 77.
- HBase client figures that the key is contains in the 50-100 region, held by RegionServer X.
- HBase client contacts RegionServer X to get the data.
How does RegionServer X finds out the location of record 77 ?
Does the RegionServer uses some kind of lookup table (e.g. like the RDBMS's B-Trees ?) for the keys of a region ? Or does it need to read all contents of the StoreFiles, for records from 50 to 77 ?
TL;DR: it looks like HBase (like BigTable), uses as a structure similar to a B+ tree to do the lookup. So the row-key is the primary index (and the only index of any sort in HBase by default.)
Long answer: From this Cloudera blog post about HBase write path, it looks like HBase operates the following way:
There's some more detail in the next paragraph:
From another Cloudera blog post, it looks like the exact format used to save the HBase on the file system keeps changing, but the above mechanism for row key lookups should be more or less consistent.
This mechanism is very, very similar to Google BigTable's lookup (you will find the details in Section 5.1 starting at the end of page 4 on the PDF), which uses a three-level heirarchy to query the row-key location: Chubby -> Root tablet -> METADATA tablets -> actual tablet
UPDATE: to answer the question about lookups inside a Region Server itself: I don't know for sure, but since the row keys are sorted, and HBase knows the start and end keys, I suspect it uses a binary search or interpolation search, both of which are really fast - log(n) and log(log(n)) respectively. I don't think HBase would ever need to scan rows from the start row key to the one that it needs to find, since searches on sorted keys is well-known problem that has several efficient solutions.