I have a LZO compressed file /data/mydata.lzo
and want to run this though some MapReduce code I have. I first create an index file using the hadoop-lzo package with the following command:
>> hadoop jar hadoop-lzo-0.4.21.jar \
com.hadoop.compression.lzo.DistributedLzoIndexer \
/data/mydata.lzo
This runs successfully
17/01/04 11:06:31 INFO mapreduce.Job: Running job: job_1472572940387_17794
17/01/04 11:06:41 INFO mapreduce.Job: Job job_1472572940387_17794 running in uber mode : false
17/01/04 11:06:41 INFO mapreduce.Job: map 0% reduce 0%
17/01/04 11:06:52 INFO mapreduce.Job: map 86% reduce 0%
17/01/04 11:06:54 INFO mapreduce.Job: map 100% reduce 0%
17/01/04 11:06:54 INFO mapreduce.Job: Job job_1472572940387_17794 completed successfully
and creates the file /data/mydata.lzo.index
. I now want to run this through some other Hadoop Java code
hadoop jar myjar.jar -input /data/mydata.lzo
It executes correctly but takes FOREVER. I noticed it only splits the file once (when I run this same job over non-LZO files it splits it about 25 times)
mapreduce.JobSubmitter: number of splits:1
What am I doing wrong?
The hadoop-lzo documentation is a little lacking. It says "Now run any job, say wordcount, over the new file". I first thought I should use the /data/mydata.lzo.index
file as my input but I get an empty output when using that. The documentation also says "Note that if you forget to index an .lzo file, the job will work but will process the entire file in a single split, which will be less efficient." So for whatever reason it is not seeing the index
file.
What is the proper way to pass the index file?
EDIT: According to this issue on GitHub the index file is automatically inferred and will split according to the file size. Still not sure why I am getting a single split.
try this: