I want to extract the information from a large file based on multiple conditions (from the same file) as well as pattern searching from other small file, Following is the script I used:
awk 'BEGIN{FS=OFS="\t"}NR==FNR{a[$0]++;next}$1 in a {print $2,$4,$5}' file2.txt file1.txt >output.txt
Now, I want to use the condition in the same awk script that ONLY print the line where the element of 4th column (any one character amongst the ATGC) matches the element of 5th column (any one character amongst the ATGC); both the column is in file 1.
Hence, in a way, I want to merge the following script with the script mentioned above:
awk '$4 " "==$5{print $2,$4,$5}' file1.txt
Following is the representation of file1.txt:
SNP Name Sample ID GC Score Allele1 - Forward Allele2 - Forward
ARS-BFGL-BAC-10172 834269752 0.9374 A G
ARS-BFGL-BAC-1020 834269752 0.9568 A A
ARS-BFGL-BAC-10245 834269752 0.7996 C C
ARS-BFGL-BAC-10345 834269752 0.9604 A C
ARS-BFGL-BAC-10365 834269752 0.5296 G G
ARS-BFGL-BAC-10591 834269752 0.4384 A A
ARS-BFGL-BAC-10793 834269752 0.9549 C C
ARS-BFGL-BAC-10867 834269752 0.9400 G G
ARS-BFGL-BAC-10951 834269752 0.5453 T T
enter code here
Following is the representation of file2.txt
ARS-BFGL-BAC-10172
ARS-BFGL-BAC-1020
ARS-BFGL-BAC-10245
ARS-BFGL-BAC-10345
ARS-BFGL-BAC-10365
ARS-BFGL-BAC-10591
ARS-BFGL-BAC-10793
ARS-BFGL-BAC-10867
ARS-BFGL-BAC-10951
Output should be:
834269752 A A
834269752 C C
834269752 G G
834269752 A A
834269752 C C
834269752 G G
834269752 T T
You can simply use boolean logic, and from your input file it seems you can get away with "normal" input field splitting, which will allow you to get rid of that space in the comparison:
As an example, here is my test
file2.txt
:And here is the result of the command above: