How to access multiple values of a key in dictionary in awk?

109 views Asked by At

So I am writing code in awk/bash. I have two files, the first of which is in the following format:

chr1_2376428_A_T
chr1_5465765_T_A
chr1_8958392_C_G
....
chrM_237426_C_G

This file spans all chromosomes and has multiple samples per chromosome. Now my second file is a gene annotation file, which has the following format:

chr1 HAVANA Exon 13642 13859 ....
chr1 HAVANA START_CODON 13642 13859 ....
....

I want to search whether values from file 1 fall in between any regions that are depicted by the start ($4) and end ($5) coordinates in the annotation file, given that the chromosomes match.

Sample Text Files:

echo -e chr1_2376428_A_T"/n"chr1_5465765_T_A"/n"chr1_8958392_C_G"/n"chr1_9542312_T_G > coordinates.txt

echo -e chr1"\t"HAVANA"\t"Exon"\t"13642"\t"13859"\n"chr1"\t"HAVANA"\t"START_CODON"\t"13642"\t"13859"\n"chr1"\t"HAVANA"\t"Exon"\t"5465750"\t"5465810"\n"chr1"\t"HAVANA"\t"gene"\t"6465750"\t"6465810"\n"chr1"\t"HAVANA"\t"Exon"\t"8958350"\t"8958461"\n" | gzip -c > gencode.v34.annotation.gtf.gz

I used the following code:

    zcat gencode.v34.annotation.gtf.gz | awk -v File=/pathto/Coordinates.txt 'BEGIN{comm="cat "File; while(comm|getline){split($1,a,"_");dict[a[1]]=a[2];}}{if ($1 in dict){if (dict[$1]>$4 && dict[1]<$5){print $1"_"dict[$1]"_"$3}}}' 

The problem here is that the results I get for each chromosome are the last possible sample per chromosome, which made me think that only the last value of the key (chromosome number) is being compared and/or only being recorded. How can I test this for all values for a key?

Current outcome:

None

Expected outcome:

chr1_5465765_Exon
chr1_8958392_Exon

Thank you


EDIT: here is the above awk code pretty-printed by gawk -o- to be legible:

BEGIN {
        comm = "cat " File
        while (comm | getline) {
                split($1, a, "_")
                dict[a[1]] = a[2]
        }
}

{
        if ($1 in dict) {
                if (dict[$1] > $4 && dict[1] < $5) {
                        print $1 "_" dict[$1] "_" $3
                }
        }
}
1

There are 1 answers

0
jhnc On
  • array[key]=value only allows a single value per key - it is overwritten whenever a new value is set (ie. each dict[a[1]] = a[2])
  • dict[1] seems to be a typo later in the script

Using a bioinformatics-specific tool such as bedtools, as mentioned by @timur-shtatland in the comments, is to be preferred. However, a naive SQL solution with SQLite could be:

$ cat >sqlite-commands <<'EOD'
    create table a (chr, v2, tag, start, end);
.separator \t
.import annotations_file a
    create table b (chr, idx, v3, v4);
.separator _
.import samples_file b
    select b.chr, b.idx, a.tag
        from a join b on a.chr = b.chr
        where a.start < b.idx and b.idx < a.end;
EOD
$ cat -A annotations_file
chr1^IHAVANA^IExon^I13642^I13859$
chr1^IHAVANA^ISTART_CODON^I13642^I13859$
chr1^IHAVANA^IExon^I5465750^I5465810$
chr1^IHAVANA^Igene^I6465750^I6465810$
chr1^IHAVANA^IExon^I8958350^I8958461$
$ cat samples_file
chr1_2376428_A_T
chr1_5465765_T_A
chr1_8958392_C_G
chr1_9542312_T_G
$

Then:

$ sqlite3 '' <sqlite-commands
chr1_5465765_Exon
chr1_8958392_Exon
$

I understand passing '' as the database uses a temporary file and allows processing data that won't fit in memory.


To load data from a compressed file without decompressing to disk, you could use a fifo:

$ mkfifo annotations_fifo
$ zcat annotations_file.gz >annotations_fifo &

and use the name of the fifo (ie. annotations_fifo) in the sqlite import meta-command. (Remember to delete the fifo afterwards.)

Or if using bash and linux, perhaps use process redirection:

$ sqlite3 '' <sqlite-commands 4< <(zcat annotations_file.gz)

and use /proc/self/fd/4 as the filename in the import meta-command: