Hadoop lzo single split after index

379 views Asked by At

I have a LZO compressed file /data/mydata.lzo and want to run this though some MapReduce code I have. I first create an index file using the hadoop-lzo package with the following command:

>> hadoop jar hadoop-lzo-0.4.21.jar \
    com.hadoop.compression.lzo.DistributedLzoIndexer \
    /data/mydata.lzo

This runs successfully

17/01/04 11:06:31 INFO mapreduce.Job: Running job: job_1472572940387_17794
17/01/04 11:06:41 INFO mapreduce.Job: Job job_1472572940387_17794 running in uber mode : false
17/01/04 11:06:41 INFO mapreduce.Job:  map 0% reduce 0%
17/01/04 11:06:52 INFO mapreduce.Job:  map 86% reduce 0%
17/01/04 11:06:54 INFO mapreduce.Job:  map 100% reduce 0%
17/01/04 11:06:54 INFO mapreduce.Job: Job job_1472572940387_17794 completed successfully

and creates the file /data/mydata.lzo.index. I now want to run this through some other Hadoop Java code

hadoop jar myjar.jar -input /data/mydata.lzo

It executes correctly but takes FOREVER. I noticed it only splits the file once (when I run this same job over non-LZO files it splits it about 25 times)

mapreduce.JobSubmitter: number of splits:1

What am I doing wrong?

The hadoop-lzo documentation is a little lacking. It says "Now run any job, say wordcount, over the new file". I first thought I should use the /data/mydata.lzo.index file as my input but I get an empty output when using that. The documentation also says "Note that if you forget to index an .lzo file, the job will work but will process the entire file in a single split, which will be less efficient." So for whatever reason it is not seeing the index file.

What is the proper way to pass the index file?

EDIT: According to this issue on GitHub the index file is automatically inferred and will split according to the file size. Still not sure why I am getting a single split.

1

There are 1 answers

0
Codefor On

try this:

hadoop jar myjar.jar -input /data/mydata.lzo -input /data/mydata.lzo.index