I am trying to write a bash script that manipulates two files:
file1
Region Coords. RsId Position Alleles Disease PValue OddsRatio RegionID
1p13.2 1:113839149..114551845 rs2476601 114377568 G>A Alopecia Areata 8.90E-08 1.34 869
2q13 2:111444884..111809030 rs3789129 111698040 A>C Alopecia Areata 1.50E-08 0.76 871
2q33.2 2:204611195..204817281 rs3096851 204763882 A>C Alopecia Areata 3.58E-08 1.32 802
2q33.2 2:204611195..204817281 rs1024161 204721752 G>A Alopecia Areata 3.55E-13 1.44 802
2q33.2 2:204611195..204817281 rs231775 204732714 A>G Alopecia Areata 2.20E-20 1.39 802
4q27 4:122982314..123605528 rs7682241 123523875 C>A Alopecia Areata 4.27E-08 1.34 803
4q27 4:122982314..123605528 rs7682481 123524026 G>C Alopecia Areata 4.80E-09 1.23 803
5q31.1 5:131783213..132135372 rs848 131996500 C>A Alopecia Areata 4.80E-09 1.27 872
file2
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
2 204721752 rs1024161 2 204732714 rs231775 0.849535
2 204721752 rs1024161 2 204763882 rs3096851 0.68029
2 204732714 rs231775 2 204763882 rs3096851 0.739633
4 123523875 rs7682241 4 123524026 rs7682481 1
I want to read file1, and if column 3 (RsId) value is not present in either column 3 (SNP_A) or column 4 (SNP_B) of file2, write that row to output. I tried the following:
#this executable file is called filter_file.sh
#!/bin/bash
file1=$1
file2=$2
outfile=$3
while read CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2; do
cat $file2 | awk "(\$3!~/$SNP_A|$SNP_B/) {print}"
done < $file1 > $outfile
./filter_file.sh file1 file2 out
The awk statement worked when I tested it on its own but when I added it to the bash while loop it printed all of file1, including the header, four times. What is wrong with my code for this step?
Once this is working, if file1 column 3 (RsId) value is present in column 3 (SNP_A) or column 4 (SNP_B) of file2, I want to write the row to output that has the lowest value for file1 column 7 (PValue).
I am not sure how to start this second part of the task. From reading other awk questions I was thinking I could try an if statement set up something like this:
#!/bin/bash
file=$1
file2=$2
outfile=$3
while read CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2; do
cat $file2 | awk "{
if ((\$3!~/$SNP_A|$SNP_B/))
print $0;
else
#Statement that prints only the row with the lowest value for column 7
}"
done < $file1 > $outfile
What are some approaches I can use to do this step?
If people can point to some tutorials that may be helpful for these types of problems, that is greatly appreciated.
The desired outputfile will look like this (order does not matter):
Region Coords. RsId Position Alleles Disease PValue OddsRatio RegionID
1p13.2 1:113839149..114551845 rs2476601 114377568 G>A Alopecia Areata 8.90E-08 1.34 869
2q13 2:111444884..111809030 rs3789129 111698040 A>C Alopecia Areata 1.50E-08 0.76 871
2q33.2 2:204611195..204817281 rs231775 204732714 A>G Alopecia Areata 2.20E-20 1.39 802
4q27 4:122982314..123605528 rs7682481 123524026 G>C Alopecia Areata 4.80E-09 1.23 803
5q31.1 5:131783213..132135372 rs848 131996500 C>A Alopecia Areata 4.80E-09 1.27 872
There's no need to loop through the file using bash, you can do the whole thing in awk:
The first block applies to the first file (
file2
) and sets keys in the arrays corresponding to the two columns of interest.next
skips any further commands, so the remainder of the script applies only to the second file (file1
). Lines are printed when the condition is true, i.e. the line number in the file (FNR
) is greater than 1 and the key cannot be found in either of the two arrays.For the second part of your problem, things get a little bit more complicated...hopefully the comments explain things:
I'm pretty certain that the logic in the
END
block can be simplified but I couldn't think of a better way to determine the "pairs". Either way, it achieves your desired output.