Given a compressed file, written on hadoop platform, in one of the following formats:
- Avro
- Parquet
- SequenceFile
How can I find the compression codec used? Assuming that one of the following compression codecs is used (and there is no file extension in the file name):
- Snappy
- Gzip (not supported on Avro)
- Deflate (not supported on Parquet)
The Java implementation of Parquet includes the
parquet-toolsutility, providing several commands. See its documentation page for building and getting started. The more detailed descriptions of the individual commands are printed byparquet-toolsitself. The command you are looking for ismeta. This will show all kinds of metadata, including compressions. You can find an example output here, showing SNAPPY compression.Please note that the compression algorithm does not have to be the same across the whole file. Different column chunks can use different compressions, therefore there is no single field for the compression codec, but one for each column chunk instead. (A column chunk is the part of a column that belong to one row group.) In practice, however, you will probably find the same compression codec being used for all column chunks.
A similar utility exists for Avro, called
avro-tool. I'm not that familiar with it, but it has agetmetacommand which should show you the compression codec used.