Sorting one file based on the order of another file in awk

105 views Asked by At

I have the following files: f1.txt:

1 1  
1 2   
3 3   
3 4   
2 5  
2 6  

f2.txt:

1  
2   
3  

I want to sort f1.txt according to f2.txt first column match, expected output is below

1 1  
1 2   
2 5   
2 6  
3 3  
3 4  

I tried something like

awk '{if(NR==FNR) {a[$1]=$0} else {b[$1]=$0} for(i in b) print a[i]}' f2.txt f1.txt 

But the results I'm getting are not what I expected.

Appreciate if someone can help

3

There are 3 answers

0
Ed Morton On

Using any awk:

$ awk -v ORS= '
    NR==FNR { a[$1] = a[$1] $0 RS; next }
    { print a[$1] }
' f1.txt f2.txt
1 1
1 2
2 5
2 6
3 3
3 4
0
ufopilot On
awk '
    FNR==NR { a[NR]=$1; next }
    { b[$1] = (b[$1] ? b[$1]"\n" $0 : $0) }
    END{ for (i=1; i<=length(a); i++) print b[a[i]] }
' f2.txt f1.txt 

1 1  
1 2
2 5
2 6
3 3
3 4
5
jhnc On

Generalising @ufopilot's answer to handle non-numbers:

awk '
    FNR==NR { a[++i]=$1; next }
    { b[$1] = b[$1] $0 RS }
    END { ORS=""; for (j=1; j<=i; j++) print b[a[j]] }
' f2.txt f1.txt

Note that if f2.txt has duplicated entries, corresponding lines will also be printed multiple times. If that is not desired, clear b[a[j]] after it is first printed.


If f1.txt is too big to fit in memory but f2.txt will fit comfortably, then you can decorate the lines using awk, then use sort to reorder, then cut off the extra column:

awk '
    FNR==NR { o[$1]=++i; next }
    { print o[$1] "\t" $0 }
' f2.txt f1.txt |
sort -s -n -k1,1 |
cut -f 2-

In this version, if f2.txt contains duplicated entries, only the final duplicate is used. So

1
3
2
3

will be treated as:

1
2
3

Reformating your code:

awk '
    {
        if (NR==FNR) {
            a[$1]=$0
        } else {
            b[$1]=$0
        }
        for (i in b) print a[i]
    }
' f2.txt f1.txt 

There are several issues.

  • for (i in b) can return elements of b in arbitrary order (gawk should return a repeatable ordering, but POSIX-standard awk may not)
  • you should be doing for (i in a) anyway (otherwise you are ignoring the f2.txt ordering entirely)
  • print a[i] tries to print lines from f2.txt but you want lines from f1.txt
  • you overwrite b[i] when you should append to it
  • the for loop runs for every line but should only run at the end

Applying fixes for these issues might give:

gawk '
    {
        if (NR==FNR) {
            a[$1]
        } else {
            b[$1] = b[$1] $0 RS
        }
    }
    END {
        for (i in a) print b[i]
    }
' f2.txt f1.txt

gives:

1 1
1 2

2 5
2 6

3 3
3 4

but would give completely different order with some other awk implementations.

My code addresses these remaining issues by:

  • altering the assignment to a so that elements can be ordered (a[++i]=$1)
  • replacing for (i in a) by a guaranteed ordering
  • redefining ORS so the blank lines don't appear

To be safer, my code should check if b[a[j]] exists before printing it:

if (a[j] in b) print b[a[j]]

If b[a[j]] does not exist (ie. f2.txt contains a value not in f1.txt), my script creates an empty element in b. Although this prints as nothing and so is not a problem here, in other scripts adding an extra element like this can introduce subtle bugs.