bash: read each line from a file and use as a regular expression to match and print column awk

667 views Asked by At

I would like to use each line of a file, samples.txt, as a regular expression and print the entire column that matches this from input.txt.

samples.txt

aa
bb
cc

input.txt

s   aa    v    dd    jj    bb    ww    cc
1   1     1    1     2     3     3     8
3   5     4    5     2     7     5     8  

output.txt

aa    bb    cc
1     3     8
5     7     8

I can do these operations separately - reading each line in bash then using it as a regular expression, and separately using the regular expression to print the matching column, but I can not put them together. Any suggestions?

To print each matching column I can use:

awk 'NR==1 {for(i=1;i<=NF;i++) if ($i~/$line/) f=i;next} {print $f}' input.txt

And to iterate through the file for each line to use as a regular expression as above:

while read line; do echo $line; done < samples.txt

However I can't put these two together...

while read line; do
    awk 'NR==1 {for(i=1;i<=NF;i++) if ($i~/$line/) f=i;next} {print $f}' input.txt >> output.txt; done < samples.txt
3

There are 3 answers

8
123 On BEST ANSWER

In awk

awk 'NR==FNR{a[$1]++;next}FNR==1{for(i=1;i<=NF;i++)b[i]=a[$i]}
            {for(i=1;i<=NF;i++)if(b[i])printf "%s\t",$i;print x}' {samples,input}.txt

aa      bb      cc
1       3       8
5       7       8

This basically collects the samples in an array, on the first file. Next on the first line of the second, compares each field to the samples and sets them to 1 if it is the same.

Then loops over each line only printing the fields that are set to one in the array.

To remove the trailing tab following (Kent|Fedorqui|Ed Morton)'s advice

awk 'NR==FNR{a[$1]++;next}FNR==1{for(i=1;i<=NF;i++)b[i]=a[$i]==1&&last=i}
     {for(i=1;i<=NF;i++)if(b[i])printf "%s",$i (i==last?ORS:OFS)}' {samples,input}.txt
0
fedorqui On

I think it is easier to transpose the input.txt file, print those lines starting with the given words and then transpose back:

$ awk 'FNR==NR {a[$1]; next} $1 in a' samples <(transpose < input) | transpose
aa bb cc
1 3 8
5 7 8

This uses the awk 'FNR==NR {do_things; next} other_things' file1 file2 to perform do_things when reading file1 and other_things when reading file2.

In this case, we load all the names from samples into an array a[]. Then, we go through the input data and check if its first field is in the array. If so, the statement evaluates to True and the line is printed.

transpose is a function I used in another answer of mine:

transpose () {
  awk '{for (i=1; i<=NF; i++) a[i,NR]=$i; max=(max<NF?NF:max)}
        END {for (i=1; i<=max; i++)
              {for (j=1; j<=NR; j++) 
                  printf "%s%s", a[i,j], (j<NR?OFS:ORS)
              }
        }'
}
0
Ed Morton On

If you do want a regexp comparsion then it's:

$ cat tst.awk
NR==FNR { colNames=(NR>1 ? colNames "|" : "") $0; next }
FNR==1 {
    numCols = 0
    for (i=1; i<=NF; i++) {
        if ( $i ~ "("colNames")" ) {
            colNrs[++numCols] = i
        }
    }
}
{
    for (i=1; i<=numCols; i++) {
        printf "%s%s", $(colNrs[i]), (i<numCols?OFS:ORS)
    }
}

$ awk -f tst.awk samples.txt input.txt
aa bb cc
1 3 8
5 7 8

If instead you actually want a string comparison then:

$ cat tst2.awk
NR==FNR { colNames[$0]; next }
FNR==1 {
    numCols = 0
    for (i=1; i<=NF; i++) {
        if ( $i in colNames ) {
            colNrs[++numCols] = i
        }
    }
}
{
    for (i=1; i<=numCols; i++) {
        printf "%s%s", $(colNrs[i]), (i<numCols?OFS:ORS)
    }
}

$ awk -f tst2.awk samples.txt input.txt
aa bb cc
1 3 8
5 7 8

To run it on multiple input files just list them all at the end of the awk command line, do not write a shell loop to call awk multiple times.