I have a folder in my HDFS system that contains text files compressed using Snappy codec.
Normally, when reading GZIP compressed files in a Hadoop Streaming job, the decompression occurs automatically. However, this is not happening when using Snappy compressed data, and I am not able to process the data.
How can I read these files and process them in Hadoop Streaming?
Many thanks in advance.
UPDATE:
If I use the command hadoop fs -text file
it works. The problem only happens when using hadoop streaming, the data is not decompressed before passed to my python script.
I think I have an answer to the problem. It would be great if someone can confirm this.
Browsing the Cloudera blog. I found this article explaining the Snappy codec. As it can be read:
Therefore a file compressed in HDFS using Snappy codec can be read using
hadoop fs -text
but not in a Hadoop Streaming job (MapReduce).