Find compression codec used for an hadoop file

Question

Find compression codec used for an hadoop file

1.6k views Asked by revy At 20 October 2018 at 18:10

Given a compressed file, written on hadoop platform, in one of the following formats:

Avro
Parquet
SequenceFile

How can I find the compression codec used? Assuming that one of the following compression codecs is used (and there is no file extension in the file name):

Snappy
Gzip (not supported on Avro)
Deflate (not supported on Parquet)

Original Q&A

There are 1 answers

**Zoltan** · Accepted Answer · 2018-10-21T11:19:38+00:00

The Java implementation of Parquet includes the parquet-tools utility, providing several commands. See its documentation page for building and getting started. The more detailed descriptions of the individual commands are printed by parquet-tools itself. The command you are looking for is meta. This will show all kinds of metadata, including compressions. You can find an example output here, showing SNAPPY compression.

Please note that the compression algorithm does not have to be the same across the whole file. Different column chunks can use different compressions, therefore there is no single field for the compression codec, but one for each column chunk instead. (A column chunk is the part of a column that belong to one row group.) In practice, however, you will probably find the same compression codec being used for all column chunks.

A similar utility exists for Avro, called avro-tool. I'm not that familiar with it, but it has a getmeta command which should show you the compression codec used.

TechQA.

Find compression codec used for an hadoop file

There are 1 answers

Related Questions in HADOOP

Related Questions in COMPRESSION

Related Questions in AVRO

Related Questions in PARQUET

Related Questions in SEQUENCEFILE

Popular Questions

Trending Questions