I permitted myself to create a new question as some parameters changed dramatically compared to my first question in my bash script optimising (Optimising my script which lookups into a big compressed file)
In short : I want to lookup, and extract all the lines where the variable of the first column of a file(1) (a bam file) matches the first column of a text file (2). For bioinformaticians, it's actually extracting the matching reads id from two files. File 1 is a binary compressed 130GB file File 2 is a tsv file of 1 billion lines
Recently a user came with a very elegant one liner combining the decompression of the file and the lookup with awk and it worked very well. With the size of the files it is now looking up for more than 200 hours (multithreaded).
- Does this "problem" have a name in algorithmics ?
- What could be a good way to tackle this challenge ? (If possible with simple solutions such as sed, awk, bash .. )
Thank you a lot
Edit : Sorry for the code, as it was on the link I though it would be a "doublon". Here is the one liner used :
#!/bin/bash
samtools view -@ 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'
Think of this as a long comment rather than an answer. The 'merge sort' method can be summarised as: If two records don't match, advance one record in the file with the smaller record. If they do match then record the match and advance one record in the big file.
In pseudocode, this looks something like:
Quite how you translate that into awk or bash is beyond my meagre knowledge of either.