I'm trying to sort file names such as
cat1.pdf, cat2.pdf, ... cat10.pdf ...
I'm utilizing a sort right now with the following parameters:
-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator
-D stream.num.map.output.key.fields=2
-D mapreduce.partition.keypartitioner.options="-k1,1"
-D mapreduce.partition.keycomparator.options="-k1,1 -k2,2 -V"
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
The key value pairs are separated by a tab with the file name as the value and a string as the key. The problem is that my sort right now secondary sorts the file names such that I get
cat1.pdf, cat10.pdf, cat2.pdf, cat3.pdf, cat30.pdf ...
How can I get it such that the files are sorted like this:
cat1.pdf, cat2.pdf, cat3.pdf ... cat10.pdf,cat11.pdf...
I'm using hadoop streaming 2.7.1