How to do a secondary sort on filenames with numbers in hadoop streaming?

400 views Asked by user110977 At 15 November 2015 at 17:30

I'm trying to sort file names such as

    cat1.pdf, cat2.pdf, ... cat10.pdf ...

I'm utilizing a sort right now with the following parameters:

    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator 
    -D stream.num.map.output.key.fields=2 
    -D mapreduce.partition.keypartitioner.options="-k1,1" 
    -D mapreduce.partition.keycomparator.options="-k1,1 -k2,2 -V" 
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

The key value pairs are separated by a tab with the file name as the value and a string as the key. The problem is that my sort right now secondary sorts the file names such that I get

    cat1.pdf, cat10.pdf, cat2.pdf, cat3.pdf, cat30.pdf ...

How can I get it such that the files are sorted like this:

    cat1.pdf, cat2.pdf, cat3.pdf ... cat10.pdf,cat11.pdf...

I'm using hadoop streaming 2.7.1

Original Q&A

TechQA.

How to do a secondary sort on filenames with numbers in hadoop streaming?

There are 0 answers

Related Questions in SORTING

Related Questions in HADOOP

Related Questions in HADOOP-STREAMING

Related Questions in HADOOP-PARTITIONING

Related Questions in SECONDARY-SORT

Popular Questions

Popular Tags

Trending Questions