Print lines that have no duplicates in a file and preserve sort order linux

318 views Asked by At

I have the following file:

2
1
4
3
2
1

I want the output like this (unique lines that don't have any duplicates and preserve order):

4
3

I tried sort file.txt | uniq -u it works, but output is sorted:

3
4

I tried awk '!x[$0]++' file.txt it keeps order, but it prints all values once:

2
1
4
3
5

There are 5 answers

2
markp-fuso On BEST ANSWER

A couple ideas to choose from:

a) read the input file twice:

awk '
FNR==NR         { counts[$0]++; next }  # 1st pass: keep count
counts[$0] == 1                         # 2nd pass: print rows with count == 1
' file.txt file.txt

b) read the input file once:

awk '
    { lines[NR] = $0                    # maintain ordering of rows
      counts[$0]++
    }
END { for ( i=1;i<=NR;i++ )             # run thru the indices of the lines[] array and ...
          if ( counts[lines[i]] == 1 )  # if the associated count == 1 then ...
             print lines[i]             # print the array entry to stdout
    }
' file.txt

Both of these generate:

4
3
6
pmf On

Here's an approach using only awk, which reads the input only once, and yet doesn't store the entire file in memory:

  • fo stores a line's first occurrence into an array. If the line isn't registered yet (!fo[$0]), save the current line number (fo[$0]=NR).
  • fq counts the frequency of a line, which is incremented for every line read (fq[$0]++).
  • Also, the yet unincremented value of fq[$0] is used as condition (which is not met on 0, i.e. the only value that would be incremented to the desired frequency of 1, and met otherwise due to an exceeding frequency) to abandon the corresponding register of first occurrence (delete fo[$0]).
  • Eventually, fo contains only items of relevance (lines occurring not more than once), with the lines' contents as indices, and the line numbers of their first occurrences as values. So, to finish, only the array's indices need to be printed in ascending order of their numeric values. One way to achieve this would be by using asorti (available in GNU Awk 4+) with the proc instruction @val_num_asc to numerically sort by values in ascending order.
awk '
  !fo[$0]  {fo[$0]=NR}
  fq[$0]++ {delete fo[$0]}
  END      {asorti(fo,fo,"@val_num_asc"); for (i in fo) print fo[i]}
'
4
3
2
amphetamachine On

Using entirely Bash built-ins, you can do this in just a few lines:

declare -A SEEN=()
while IFS= read -r LINE; do
    (( ++SEEN[_$LINE] ))
done < file.txt
while IFS= read -r LINE; do
    if [[ ${SEEN[_$LINE]} -eq 1 ]]; then
        printf -- '%s\n' "$LINE"
    fi
done < file.txt

Note: The _$LINE as the subscript is to handle empty lines correctly.

0
dawg On

Here is a Ruby to do that:

ruby -lne 'BEGIN{cnt=Hash.new {|h,k| h[k] = 0} } 
cnt[$_]+=1
END{puts cnt.select{|k,v| v==1}.keys.join("\n") }
' file 

Prints:

4
3

Or, in one read of the file:

ruby -e 'puts $<.read.split(/\R+/).
            group_by{|x| x}.select{|k,v| v.length==1}.keys.join("\n")
' file 
# same output

Unlike awk, Ruby associative arrays maintain insertion order.

If you want a one pass awk you could do:

awk 'BEGIN{OFS="\t"}
{ if (seen[$0]++) delete order[$0]; else order[$0]=FNR } 
END { for ( e in order ) print order[e], e } ' file | sort -nk 1,1 | cut -f2-
# same output

(Thanks Ed Morton for a better awk!)

0
pmf On

I tried sort file.txt | uniq -u it works, but output is sorted

You could take that output, and use it as a list of newline-delimited patterns with grep -f on the original file. Use -Fx to treat the patterns as whole line fixed strings (not regular expressions).

sort file.txt | uniq -u | grep -Fxf- file.txt
4
3