Bash script using gzip and bcftools running out of memory with large files

Question

Bash script using gzip and bcftools running out of memory with large files

470 views Asked by NHellmann At 06 January 2025 at 12:38

This bash script is meant to be part of a pipeline that processes zipped .vcf file that contain genomes from multiple patients (which means the files are huge even when zipped, like 3-5GB).

My problem is that I keep running out of memory when running this script. It is being run in a GCP high mem VM.

I am hoping there is a way to optimize the memory usage so that this doesn't fail. I looked into it but found nothing.

#!/bin/bash

for filename in ./*.vcf.gz; do
    [ -e "$filename" ] || continue 
    name=${filename##*/}
    base=${name%.vcf.gz}
    bcftools query -l "$filename" >> ${base}_list.txt
    for line in `cat ${base}_list.txt`; do 
        bcftools view -s "$line" "$filename" -o ${line}.vcf.gz
        gzip ${line}.vcf 
    done
done

Original Q&A

There are 2 answers

AudioBubble On 21 January 2021 at 22:40

You can try to use split on each file (into a constant size) and then gzip the file splits.

https://man7.org/linux/man-pages/man1/split.1.html

**Socowi** · Accepted Answer · 2021-01-21T22:14:15+00:00

If you run out of memory when using bcftools query/view or gzip look for options in the manual that might reduce the memory footprint. In case of gzip you might also switch to an alternative implementation. You could even consider switching the compression algorithm altogether (zstd is pretty good).

However, I have a feeling that the problem could be for line in `cat ${base}_list.txt`;. The whole file ..._list.txt is loaded into memory before the loop even starts. Also, reading lines that way has all kinds of problems, like splitting lines at whitespace, expanding globs like * and so on. Use this instead:

while read -r line; do 
    bcftools view -s "$line" "$filename" -o "$line.vcf.gz"
    gzip "$line.vcf"
done < "${base}_list.txt"

By the way: Are you sure you want bcftools query -l "$filename" >> ${base}_list.txt to append. The file ${base}_list.txt will keep growing each time the script is executed. Consider overwriting the file using > instead of >>.
However, in that case you might not need the file at all as you could use this instead:

bcftools query -l "$filename" |
while read -r line; do 
    bcftools view -s "$line" "$filename" -o "$line.vcf.gz"
    gzip "$line.vcf"
done

TechQA.

Bash script using gzip and bcftools running out of memory with large files

There are 2 answers

Related Questions in BASH

Related Questions in MEMORY

Related Questions in GOOGLE-CLOUD-PLATFORM

Related Questions in GZIP

Related Questions in BCFTOOLS

Popular Questions

Popular Tags

Trending Questions