Why does 'sort' seem to sort a field incorrectly based on the presence or absence of a different field?

56 views Asked by At

I found some data which seems to behave strangely in 'sort'. When doing a numerical sort on the first field of a csv file, the presence or absence of the 4th column causes the 7th line to be sorted incorrectly.

I'm using GNU sort 8.21 on Slackware64-current.

Data: https://gist.github.com/anonymous/2a7beb4871b25ae8f8b3

This works:

cut -d , -f 1-3 < weird.csv | sort -t , -k 1n

This does not work:

cat weird.csv | sort -t , -k 1n

The 7th line seems to be sorted incorrectly.

I can't seem to find any obvious explanation for this behavior. Using 'g' instead of 'n' has the behavior I would expect, but I'm not clear on what the difference is between 'g' and 'n'.

1

There are 1 answers

0
Sitwon On BEST ANSWER

I found out what I was doing wrong. Detailed explanation provided here: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19021

In short, I should have used '-k 1,1n' to specify that sorting should start and end at field 1. Because I didn't specify an ending field and my locale silently ignores commas in numbers it wasn't comparing the numbers I thought it was comparing.