Evaluating condition of if statement in awk using a second file

288 views Asked by At

I am trying to write a bash script that manipulates two files:

file1

Region  Coords. RsId    Position    Alleles Disease PValue  OddsRatio   RegionID
1p13.2  1:113839149..114551845  rs2476601   114377568   G>A Alopecia Areata 8.90E-08    1.34    869
2q13    2:111444884..111809030  rs3789129   111698040   A>C Alopecia Areata 1.50E-08    0.76    871
2q33.2  2:204611195..204817281  rs3096851   204763882   A>C Alopecia Areata 3.58E-08    1.32    802
2q33.2  2:204611195..204817281  rs1024161   204721752   G>A Alopecia Areata 3.55E-13    1.44    802
2q33.2  2:204611195..204817281  rs231775    204732714   A>G Alopecia Areata 2.20E-20    1.39    802
4q27    4:122982314..123605528  rs7682241   123523875   C>A Alopecia Areata 4.27E-08    1.34    803
4q27    4:122982314..123605528  rs7682481   123524026   G>C Alopecia Areata 4.80E-09    1.23    803
5q31.1  5:131783213..132135372  rs848   131996500   C>A Alopecia Areata 4.80E-09    1.27    872

file2

CHR_A         BP_A       SNP_A  CHR_B         BP_B       SNP_B           R2 
 2    204721752   rs1024161      2    204732714    rs231775     0.849535 
 2    204721752   rs1024161      2    204763882   rs3096851      0.68029 
 2    204732714    rs231775      2    204763882   rs3096851     0.739633 
 4    123523875   rs7682241      4    123524026   rs7682481            1

I want to read file1, and if column 3 (RsId) value is not present in either column 3 (SNP_A) or column 4 (SNP_B) of file2, write that row to output. I tried the following:

#this executable file is called filter_file.sh

#!/bin/bash
file1=$1
file2=$2
outfile=$3

while read CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2; do 
cat $file2 | awk "(\$3!~/$SNP_A|$SNP_B/) {print}"
done < $file1 > $outfile

./filter_file.sh file1 file2 out

The awk statement worked when I tested it on its own but when I added it to the bash while loop it printed all of file1, including the header, four times. What is wrong with my code for this step?

Once this is working, if file1 column 3 (RsId) value is present in column 3 (SNP_A) or column 4 (SNP_B) of file2, I want to write the row to output that has the lowest value for file1 column 7 (PValue).

I am not sure how to start this second part of the task. From reading other awk questions I was thinking I could try an if statement set up something like this:

#!/bin/bash
file=$1
file2=$2
outfile=$3

while read CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2; do 
cat $file2 | awk "{
if ((\$3!~/$SNP_A|$SNP_B/))
    print $0;
else
    #Statement that prints only the row with the lowest value for column 7 
}"
done < $file1 > $outfile

What are some approaches I can use to do this step?

If people can point to some tutorials that may be helpful for these types of problems, that is greatly appreciated.

The desired outputfile will look like this (order does not matter):

Region  Coords. RsId    Position    Alleles Disease PValue  OddsRatio   RegionID
1p13.2  1:113839149..114551845  rs2476601   114377568   G>A Alopecia Areata 8.90E-08    1.34    869
2q13    2:111444884..111809030  rs3789129   111698040   A>C Alopecia Areata 1.50E-08    0.76    871
2q33.2  2:204611195..204817281  rs231775    204732714   A>G Alopecia Areata 2.20E-20    1.39    802
4q27    4:122982314..123605528  rs7682481   123524026   G>C Alopecia Areata 4.80E-09    1.23    803
5q31.1  5:131783213..132135372  rs848   131996500   C>A Alopecia Areata 4.80E-09    1.27    872
2

There are 2 answers

3
Tom Fenech On BEST ANSWER

There's no need to loop through the file using bash, you can do the whole thing in awk:

$ awk 'NR == FNR && NR > 1 {snp_a[$3]; snp_b[$6]; next}
       FNR > 1 && !($3 in snp_a || $3 in snp_b)' file2 file1
1p13.2  1:113839149..114551845  rs2476601   114377568   G>A Alopecia Areata 8.90E-08    1.34    869
2q13    2:111444884..111809030  rs3789129   111698040   A>C Alopecia Areata 1.50E-08    0.76    871
5q31.1  5:131783213..132135372  rs848   131996500   C>A Alopecia Areata 4.80E-09    1.27    872

The first block applies to the first file (file2) and sets keys in the arrays corresponding to the two columns of interest. next skips any further commands, so the remainder of the script applies only to the second file (file1). Lines are printed when the condition is true, i.e. the line number in the file (FNR) is greater than 1 and the key cannot be found in either of the two arrays.


For the second part of your problem, things get a little bit more complicated...hopefully the comments explain things:

$ cat script.awk
# first file, save individual columns and pairs
NR == FNR && NR > 1 {snp_a[$3]; snp_b[$6]; pair[$3,$6]; next}

# second file, print first line
NR != FNR && FNR == 1

# second file, rest of lines
FNR > 1 {
    # print lines which aren't in either array
    if (!($3 in snp_a || $3 in snp_b)) {print}
    # save other lines and corresponding p-value
    else {s[$3] = $0; p[$3] = $8}
}
END {
    # loop through all lines
    for (i in s) {
        # empty the array f
        split("", f)
        # set initial line and min
        line = s[i]
        min = p[i]

        # locate associated lines
        if (i in snp_a) {
            for (j in snp_b) {
                # SUBSEP is a special variable used when combining keys
                # as in first block pair[$3,$6]
                if (i SUBSEP j in pair && j in s && p[j] < min) {
                    min = p[j]
                    line = s[j]
                }
            }
        }
        else if (i in snp_b) {
            for (j in snp_a) {
                if (j SUBSEP i in pair && j in s && p[j] < min) {
                    min = p[j]
                    line = s[j]
                }
            }
        }
        out[line]
    }
    for (i in out) print i
}
$ awk -f script.awk file2 file1
Region  Coords. RsId    Position    Alleles Disease PValue  OddsRatio   RegionID
1p13.2  1:113839149..114551845  rs2476601   114377568   G>A Alopecia Areata 8.90E-08    1.34    869
2q13    2:111444884..111809030  rs3789129   111698040   A>C Alopecia Areata 1.50E-08    0.76    871
5q31.1  5:131783213..132135372  rs848   131996500   C>A Alopecia Areata 4.80E-09    1.27    872
2q33.2  2:204611195..204817281  rs231775    204732714   A>G Alopecia Areata 2.20E-20    1.39    802
4q27    4:122982314..123605528  rs7682481   123524026   G>C Alopecia Areata 4.80E-09    1.23    803

I'm pretty certain that the logic in the END block can be simplified but I couldn't think of a better way to determine the "pairs". Either way, it achieves your desired output.

4
hek2mgl On

You can use awk for that:

awk 'NR==FNR&&NR>1{r[NR]=$3$6;next}{p=1;for(i in r){if(r[i]~$3){p=0;break}}}p' \
  file2 file1

It is simpler to understand in a multiline version including comments:

# NR (row number) and FNR (current input file's row number) are equal
# only as long as we are processing the first input file.
# Means the following block only runs on file2 which is getting passed first.
# NR > 1 makes sure that we skip the header line.
NR==FNR && NR>1 {
    # Push $3 concatenated with $6 to an array call r (rows)
    r[NR]=$3$6
    # Stop processing this line and use the next. This makes sure
    # that the following block will only operate on file1
    next
}

# The following block runs on every line of file1
{
    # Init p flag
    p=1 
    # Iterate to the previously stored values
    for(i in r) {
        # If the entry of r is not matching field 3
        # Print the current line.
        if(r[i]~$3){
            # If a line matches we reset the p flag
            # and break the loop
            p=0
            break
        }
    }   
}

# Print if p flag isset (print is the default action in awk)
p

Output:

CHR_A         BP_A       SNP_A  CHR_B         BP_B       SNP_B           R2 
Region  Coords. RsId    Position    Alleles Disease PValue  OddsRatio   RegionID
1p13.2  1:113839149..114551845  rs2476601   114377568   G>A Alopecia Areata 8.90E-08    1.34    869
2q13    2:111444884..111809030  rs3789129   111698040   A>C Alopecia Areata 1.50E-08    0.76    871
5q31.1  5:131783213..132135372  rs848   131996500   C>A Alopecia Areata 4.80E-09    1.27    872