How to find the X highest values in a list depending on the values of another list with bash/awk?

136 views Asked by At

I found a bit of answers with this topic on this forum : How to find the X lowest values in a list with bash/awk?

But my problem is a little bit different.

This is an exemple of inpout.txt file (a "," delimited text file which can be open as csv file):

ID, Code, Value, Store
SP|01, AABBCDE, 15, 3 
SP|01, AABBCDE, 14, 2 
SP|01, AABBCDF, 13, 2
SP|01, AABBCDE, 16, 3 
SP|02, AABBCED, 15, 2 
SP|01, AABBCDF, 12, 3
SP|01, AABBCDD, 13, 2 
SP|02, AABBCDF, 9, 2
SP|01, AABBCDF, 8, 3

The idea is to print rows with the 2 highest values in "Value" column for identical values in "Code" (And also keep the headers)

Exemple of output files:

ID Code Value Store
SP|01, AABBCDE, 16, 3
SP|01, AABBCDE, 15, 3
SP|02, AABBCED, 15, 2
SP|01, AABBCDD, 13, 2
SP|01, AABBCDF, 13, 2
SP|01, AABBCDF, 12, 3

I'm new to Linux and have a bit of knowledge in very basic use of AWK, sed, grep ect but I'm unsure how to manipulate the file to get the output as stated above.

Any help would be very much appreciated!

1

There are 1 answers

6
Zsolt Botykai On
awk -F "," '
# it's the header
NR == 1 { header=$0 ; OFS=FS ; next }
# check if we stored the code already
! ( $2 in codes ) { codes[$2]=1 }
# store the max and line shift the previous max, skip to next line
$3 >= $2[1] { 
    $2[2] =  $2[1] ; 
    $2[4] = $2[3] ; 
    $2[3] = $0 ; 
    $2[1] = $3 ; 
    next 
}
# store the second greatest value
$3 > $2[2] { 
    $2[2] =  $3 ; 
    $2[4] = $0 
}
# finally print the maxes
END {
print header
for (c in codes) {
    print c[3]
    print c[4]
    }
}' INPUTFILE

Note, this prints the codes in random order, but the value lines (in the codes array) will be sorted (desc.).

UPDATE: the problem with the above code is that GAwk does not support dynamic array names. It can be solved with multi dimensional arrays which GAwk supports in a strange manner, see here: http://objectmix.com/awk/27035-dynamic-array-names.html and Awk array iteration for multi-dimensional arrays and the official docs: http://www.gnu.org/software/gawk/manual/html_node/Multi_002ddimensional.html . I'd recommend going that way.