Bash: get value of one column based on value of another column

1.6k views Asked by At

I have a space-separated file that contains:

5.75e-01 7.00e-1 5.52e-01 7.33e-01 ./dir1/dir2/file1.csv
5.75e-01 7.00e-1 5.42e-01 7.34e-01 ./dir1/dir2/file2.csv
5.75e-01 7.00e-1 5.72e-01 7.43e-01 ./dir2/dir2/file1.csv
5.75e-01 7.00e-1 5.22e-01 7.23e-01 ./dir2/dir2/file2.csv
5.75e-01 7.00e-1 5.02e-01 7.93e-01 ./dir3/dir2/file1.csv
5.75e-01 7.00e-1 5.12e-01 7.63e-01 ./dir3/dir2/file2.csv

I would like to extract the value of column 5 that corresponds to the maximum of column 3 for each value of dir#. For example, let's say I'm talking about dir1. That corresponds to these rows:

5.75e-01 7.00e-1 5.52e-01 7.33e-01 ./dir1/dir2/file1.csv
5.75e-01 7.00e-1 5.42e-01 7.34e-01 ./dir1/dir2/file2.csv

and I can find these using:

max_val_acc_=$(awk '$5 ~ /dir1/ { print }' filename.txt)
echo $max_val_acc

Now I think I need to pipe this result through a sort and take the head, but I can't get it working. The result I'm looking for (for dir1) is:

./dir1/dir2/file1.csv

and the complete result for all dir#:

./dir1/dir2/file1.csv
./dir2/dir2/file1.csv
./dir3/dir2/file2.csv
5

There are 5 answers

1
James Brown On

I'm not sure I understood you right but this is in awk how I understood you:

awk -v s="dir1" '         # search parameter in your s
index($5,"./" s "/") {    # if your s is found in $5
    if(max==""||$3>max){  # we initialize $3 or compare to previous max
        max=$3;           # store new max
        maxv=$5           # and new mac value 
    }
}
END{ print maxv }         # print the stored max value
' file                    # oh just the file
./dir1/dir2/file1.csv
0
Ed Morton On

Is this what you're looking for?

$ cat tst.awk
{
    split($5,path,"/")
    dir = path[2]
    if ( !(dir in max) || ($3 > max[dir]) ) {
        max[dir] = $3
        val[dir] = $5
    }
}
END {
    for (dir in val) {
        print val[dir]
    }
}

$ awk -f tst.awk file
./dir3/dir2/file2.csv
./dir1/dir2/file1.csv
./dir2/dir2/file1.csv
1
karakfa On

another alternative with sort and awk

$ sort -k5 -k3,3r file | awk -F/ '!a[$NF]++'

5.75e-01 7.00e-1 5.52e-01 7.33e-01 ./dir1/dir2/file1.csv
5.75e-01 7.00e-1 5.42e-01 7.34e-01 ./dir1/dir2/file2.csv

above is for filenames, if based on first dir names

$ sort -k3,3r file | awk '{split($NF,d,"/")} !a[d[2]]++'

5.75e-01 7.00e-1 5.72e-01 7.43e-01 ./dir2/dir2/file1.csv
5.75e-01 7.00e-1 5.52e-01 7.33e-01 ./dir1/dir2/file1.csv
5.75e-01 7.00e-1 5.12e-01 7.63e-01 ./dir3/dir2/file2.csv

and if you want to print the dirs only

$ sort -k3,3r file | awk '{split($NF,d,"/")} !a[d[2]]++{print $NF}'

./dir2/dir2/file1.csv
./dir1/dir2/file1.csv
./dir3/dir2/file2.csv
0
clt60 On

Just for fun - without any (awk, perl like) programming language

file="./data.txt"
paste -d ' ' "$file" <(cut -d/ -f2 "$file") |\
    LC_ALL=C sort -k6 -k3gr | uniq -f5 | cut -d' ' -f5

output

./dir1/dir2/file1.csv
./dir2/dir2/file1.csv
./dir3/dir2/file2.csv

Use some awk solution. As i said, this is only for showing another way.

0
gregory On

While not matching your requested output and similar to the other answers already posted, I find this awk command more memorable:

< file | sort -k3,3r | awk -F "/" '!seen[$2]++'

output:

5.75e-01 7.00e-1 5.72e-01 7.43e-01 ./dir2/dir2/file1.csv
5.75e-01 7.00e-1 5.52e-01 7.33e-01 ./dir1/dir2/file1.csv
5.75e-01 7.00e-1 5.12e-01 7.63e-01 ./dir3/dir2/file2.csv

The more general point being: sort the entire list by size (column 3) and don't worry about sorting it by directory name (part of column 5), then extract the first of each directory name (i.e. awk prints only newly seen key name).

If you really want the output to just the directories names and sorted, then add the following to the pipe chain:

| cut -d ' ' -f5- | sort