Looking up and extracting a line from a big file matching the lines of another big file

145 views Asked by At

I permitted myself to create a new question as some parameters changed dramatically compared to my first question in my bash script optimising (Optimising my script which lookups into a big compressed file)

In short : I want to lookup, and extract all the lines where the variable of the first column of a file(1) (a bam file) matches the first column of a text file (2). For bioinformaticians, it's actually extracting the matching reads id from two files. File 1 is a binary compressed 130GB file File 2 is a tsv file of 1 billion lines

Recently a user came with a very elegant one liner combining the decompression of the file and the lookup with awk and it worked very well. With the size of the files it is now looking up for more than 200 hours (multithreaded).

  1. Does this "problem" have a name in algorithmics ?
  2. What could be a good way to tackle this challenge ? (If possible with simple solutions such as sed, awk, bash .. )

Thank you a lot

Edit : Sorry for the code, as it was on the link I though it would be a "doublon". Here is the one liner used :

#!/bin/bash

samtools view -@ 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}' 
1

There are 1 answers

5
rossum On

Think of this as a long comment rather than an answer. The 'merge sort' method can be summarised as: If two records don't match, advance one record in the file with the smaller record. If they do match then record the match and advance one record in the big file.

In pseudocode, this looks something like:

currentSmall <- readFirstRecord(smallFile)
currentLarge <- readFirstRecord(largeFile)
searching <- true
while (searching)
  if (currentLarge < currentSmall)
    currentLarge <- readNextRecord(largeFile)
  else if (currentLarge = currentSmall)
    //Bingo!
    saveMatchData(currentLarge, currentSmall)
    currentLarge <- readNextRecord(largeFile)
  else if (currentLarge > currentsmall)
    currentSmall <- readNextRecord(smallFile)
  endif

  if (largeFile.EOF or smallFile.EOF)
    searching <- false
  endif
endwhile

Quite how you translate that into awk or bash is beyond my meagre knowledge of either.