Sorting one file based on the order of another file in awk

Question

Sorting one file based on the order of another file in awk

105 views Asked by Abdel Ra'ouf Badawi At 18 August 2023 at 23:54

I have the following files: f1.txt:

f2.txt:

1  
2   
3

I want to sort f1.txt according to f2.txt first column match, expected output is below

I tried something like

awk '{if(NR==FNR) {a[$1]=$0} else {b[$1]=$0} for(i in b) print a[i]}' f2.txt f1.txt

But the results I'm getting are not what I expected.

Appreciate if someone can help

Original Q&A

There are 3 answers

**Ed Morton** · Answer 1 · 2023-08-19 15:13:46

Using any awk:

$ awk -v ORS= '
    NR==FNR { a[$1] = a[$1] $0 RS; next }
    { print a[$1] }
' f1.txt f2.txt
1 1
1 2
2 5
2 6
3 3
3 4

**ufopilot** · Answer 2 · 2023-08-19 09:21:27

awk '
    FNR==NR { a[NR]=$1; next }
    { b[$1] = (b[$1] ? b[$1]"\n" $0 : $0) }
    END{ for (i=1; i<=length(a); i++) print b[a[i]] }
' f2.txt f1.txt 

1 1  
1 2
2 5
2 6
3 3
3 4

**jhnc** · Answer 3 · 2023-08-19 09:35:58

Generalising @ufopilot's answer to handle non-numbers:

awk '
    FNR==NR { a[++i]=$1; next }
    { b[$1] = b[$1] $0 RS }
    END { ORS=""; for (j=1; j<=i; j++) print b[a[j]] }
' f2.txt f1.txt

Note that if f2.txt has duplicated entries, corresponding lines will also be printed multiple times. If that is not desired, clear b[a[j]] after it is first printed.

If f1.txt is too big to fit in memory but f2.txt will fit comfortably, then you can decorate the lines using awk, then use sort to reorder, then cut off the extra column:

awk '
    FNR==NR { o[$1]=++i; next }
    { print o[$1] "\t" $0 }
' f2.txt f1.txt |
sort -s -n -k1,1 |
cut -f 2-

In this version, if f2.txt contains duplicated entries, only the final duplicate is used. So

will be treated as:

1
2
3

Reformating your code:

awk '
    {
        if (NR==FNR) {
            a[$1]=$0
        } else {
            b[$1]=$0
        }
        for (i in b) print a[i]
    }
' f2.txt f1.txt

There are several issues.

for (i in b) can return elements of b in arbitrary order (gawk should return a repeatable ordering, but POSIX-standard awk may not)
you should be doing for (i in a) anyway (otherwise you are ignoring the f2.txt ordering entirely)
print a[i] tries to print lines from f2.txt but you want lines from f1.txt
you overwrite b[i] when you should append to it
the for loop runs for every line but should only run at the end

Applying fixes for these issues might give:

gawk '
    {
        if (NR==FNR) {
            a[$1]
        } else {
            b[$1] = b[$1] $0 RS
        }
    }
    END {
        for (i in a) print b[i]
    }
' f2.txt f1.txt

gives:

but would give completely different order with some other awk implementations.

My code addresses these remaining issues by:

altering the assignment to a so that elements can be ordered (a[++i]=$1)
replacing for (i in a) by a guaranteed ordering
redefining ORS so the blank lines don't appear

To be safer, my code should check if b[a[j]] exists before printing it:

if (a[j] in b) print b[a[j]]

If b[a[j]] does not exist (ie. f2.txt contains a value not in f1.txt), my script creates an empty element in b. Although this prints as nothing and so is not a problem here, in other scripts adding an extra element like this can introduce subtle bugs.

TechQA.

Sorting one file based on the order of another file in awk

There are 3 answers

Related Questions in FILE

Related Questions in SORTING

Related Questions in AWK

Popular Questions

Popular Tags

Trending Questions