Sequence order in the column oriented formats chapter of book Hadoop the definitive guide?

Question

Sequence order in the column oriented formats chapter of book Hadoop the definitive guide?

163 views Asked by callofdutyops At 05 October 2018 at 17:17

In page 137 of Hadoop: The definitive guide 4th edition, it talks about column oriented formats file and shows a picture below.

In the RCFile, why the sequence order of numbers is 1,4,2,5,3,6,7,10,8,11,9,12 rather than 1,4,7,10,2,5,8,11,3,6,9,12?

Original Q&A

There are 1 answers

**leftjoin** · Accepted Answer · 2018-10-05T19:38:29+00:00

First of all, RC is not columnar file, it is Record Columnar file. RC as well as ORC are splittable. This means you do not read all the file to get only few rows and it can be read in parallel by many containers. And this is why we need splits.

Splits contain rows that are grouped together and can be read independent of each other, and at the same time columns are also grouped inside splits. Similar data can be compressed better, so if columns are grouped together, it improves compression. In your example one split contains only two rows, but it can contain 10000 or more rows.

What the official documentation says about RC file:

As row-store, RCFile guarantees that data in the same row are located in the same node.
As column-store, RCFile can exploit column-wise data compression and skip unnecessary column reads.

Also read about ORC. Using indexes in ORC, stripes can be easily filtered on the lowest level. This feature is called predicate push down.

TechQA.

Sequence order in the column oriented formats chapter of book Hadoop the definitive guide?

There are 1 answers

Related Questions in HADOOP

Related Questions in HIVE

Related Questions in COLUMN-ORIENTED

Popular Questions

Trending Questions