I am writing to hadoop hdfs. The file has to be compressed using lzo. Also the file will be appended in realtime.
The source file is a gzip file that is not present in hadoop. A batch processes this gzip file, and then does lzo compression and appends to hadoop. Does this eliminate the possibility of using map reduce ?
How can we achieve this ?
Thanks in advance for the help
You can write direct to HDFS from custom java code:
This code works for zlib compression - for LZO compression, have you already got some java library that can perform the compression for you (such as the hadoop-gpl-compression library). If you install the above library as detailed, then all you need to do is amend the output path extension to ".lzo_deflate" and everything should just work. If you want to use another compression library, you can skip the CompressionCodecFactory block of code and wrap the outputStream directly.
As for appending to the files - depending on your version of hadoop this may of may not be supported. You also need to consider whether your compression library supports concatenated files (GZip for example does, but there are some problems with earlier versions of Java / hadoop in dealing with these types). If you do have a version of hadoop that supports appending, and your compression library suuports it, then amend the
fs.create(outputPath)
call tofs.append(outputPath)