Bash script using gzip and bcftools running out of memory with large files

434 views Asked by At

This bash script is meant to be part of a pipeline that processes zipped .vcf file that contain genomes from multiple patients (which means the files are huge even when zipped, like 3-5GB).

My problem is that I keep running out of memory when running this script. It is being run in a GCP high mem VM.

I am hoping there is a way to optimize the memory usage so that this doesn't fail. I looked into it but found nothing.

#!/bin/bash

for filename in ./*.vcf.gz; do
    [ -e "$filename" ] || continue 
    name=${filename##*/}
    base=${name%.vcf.gz}
    bcftools query -l "$filename" >> ${base}_list.txt
    for line in `cat ${base}_list.txt`; do 
        bcftools view -s "$line" "$filename" -o ${line}.vcf.gz
        gzip ${line}.vcf 
    done
done
2

There are 2 answers

0
Socowi On BEST ANSWER

If you run out of memory when using bcftools query/view or gzip look for options in the manual that might reduce the memory footprint. In case of gzip you might also switch to an alternative implementation. You could even consider switching the compression algorithm altogether (zstd is pretty good).

However, I have a feeling that the problem could be for line in `cat ${base}_list.txt`;. The whole file ..._list.txt is loaded into memory before the loop even starts. Also, reading lines that way has all kinds of problems, like splitting lines at whitespace, expanding globs like * and so on. Use this instead:

while read -r line; do 
    bcftools view -s "$line" "$filename" -o "$line.vcf.gz"
    gzip "$line.vcf"
done < "${base}_list.txt"

By the way: Are you sure you want bcftools query -l "$filename" >> ${base}_list.txt to append. The file ${base}_list.txt will keep growing each time the script is executed. Consider overwriting the file using > instead of >>.
However, in that case you might not need the file at all as you could use this instead:

bcftools query -l "$filename" |
while read -r line; do 
    bcftools view -s "$line" "$filename" -o "$line.vcf.gz"
    gzip "$line.vcf"
done
0
AudioBubble On

You can try to use split on each file (into a constant size) and then gzip the file splits.

https://man7.org/linux/man-pages/man1/split.1.html