Sequence order in the column oriented formats chapter of book Hadoop the definitive guide?

115 views Asked by At

In page 137 of Hadoop: The definitive guide 4th edition, it talks about column oriented formats file and shows a picture below.

enter image description here

In the RCFile, why the sequence order of numbers is 1,4,2,5,3,6,7,10,8,11,9,12 rather than 1,4,7,10,2,5,8,11,3,6,9,12?

1

There are 1 answers

1
leftjoin On BEST ANSWER

First of all, RC is not columnar file, it is Record Columnar file. RC as well as ORC are splittable. This means you do not read all the file to get only few rows and it can be read in parallel by many containers. And this is why we need splits.

Splits contain rows that are grouped together and can be read independent of each other, and at the same time columns are also grouped inside splits. Similar data can be compressed better, so if columns are grouped together, it improves compression. In your example one split contains only two rows, but it can contain 10000 or more rows.

What the official documentation says about RC file:

  • As row-store, RCFile guarantees that data in the same row are located in the same node.

  • As column-store, RCFile can exploit column-wise data compression and skip unnecessary column reads.

Also read about ORC. Using indexes in ORC, stripes can be easily filtered on the lowest level. This feature is called predicate push down.