How to write to hadoop hdfs using lzo compression without map reduce

5k views Asked by At

I am writing to hadoop hdfs. The file has to be compressed using lzo. Also the file will be appended in realtime.

The source file is a gzip file that is not present in hadoop. A batch processes this gzip file, and then does lzo compression and appends to hadoop. Does this eliminate the possibility of using map reduce ?

How can we achieve this ?

Thanks in advance for the help

1

There are 1 answers

2
Chris White On

You can write direct to HDFS from custom java code:

public class HdfsWrite extends Configured implements Tool {
    public int run(String[] arg0) throws Exception {

        // create am HDFS file system
        FileSystem fs = FileSystem.get(getConf());

        // create an output stream to write to a new file in hdfs
        Path outputPath = new Path(
                "/path/to/file/in/hdfs.default");
        OutputStream outputStream = fs.create(outputPath);

        // now wrap the output stream with a Zlib compression codec
        CompressionCodecFactory codecFactory = new CompressionCodecFactory(getConf());
        CompressionCodec codec = codecFactory.getCodec(outputPath);
        CompressionOutputStream compressedOutput = codec.createOutputStream(outputStream);

        // send content to file via compressed output stream using .write methods
        // ..

        // close out stream
        compressedOutput.close();

        return 0;
    }    

    public static void main(String[] args) throws Exception {
        ToolRunner.run(new HdfsWrite(), args);
    }
}

This code works for zlib compression - for LZO compression, have you already got some java library that can perform the compression for you (such as the hadoop-gpl-compression library). If you install the above library as detailed, then all you need to do is amend the output path extension to ".lzo_deflate" and everything should just work. If you want to use another compression library, you can skip the CompressionCodecFactory block of code and wrap the outputStream directly.

As for appending to the files - depending on your version of hadoop this may of may not be supported. You also need to consider whether your compression library supports concatenated files (GZip for example does, but there are some problems with earlier versions of Java / hadoop in dealing with these types). If you do have a version of hadoop that supports appending, and your compression library suuports it, then amend the fs.create(outputPath) call to fs.append(outputPath)