Hello StackOverflow community,
I'm dealing with a large VCF file containing default columns and phased genotypes for multiple samples. After processing it with another tool, I end up with a subset of variants and additional metadata for each sample. My goal is to expand the second file to include all variants from the first file, using metadata from the closest variant position.
Additional info for non-bioinformaticians:
- Each row is a variant, with two position details (CHROM and POS).
- When we expand the file, I basically intend to identify the variant from larger file and find which variant it is closest to in the processed file and then copy the metadata to output file.
- Both (original VCF file and output VCF file) are sorted files.
Original VCF file (excerpt):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr1 1001 rs123 A G 20.5 PASS DP=30 GT 0/1
chr1 2001 . T C 15.2 PASS DP=25 GT 1/1
chr1 3001 rs456 C T 18.7 PASS DP=28 GT 0/0
chr1 4001 . G A 22.0 PASS DP=35 GT 1/0
chr2 5001 rs789 A T 25.5 PASS DP=40 GT 1/1
chr2 6001 . C G 30.1 PASS DP=32 GT 0/1
chr2 7001 rs987 T A 17.3 PASS DP=27 GT 1/1
chr3 8001 . G C 19.8 PASS DP=22 GT 0/0
chr3 9001 rs654 A G 21.4 PASS DP=26 GT 1/0
chr3 10001 . T C 23.2 PASS DP=29 GT 0/1
Processed VCF (excerpt):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1
chr1 1001 rs123 A G 20.5 PASS DP=30 GT:X:Y:Z 0/1:0:0:1
chr1 3001 rs456 C T 18.7 PASS DP=28 GT:X:Y:Z 0/0:1:3:1
chr2 6001 . C G 30.1 PASS DP=32 GT:X:Y:Z 0/1:2:1:1
chr3 9001 . T C 23.2 PASS DP=29 GT:X:Y:Z 0/1:0:1:0
Desired Final Output:
chr1 1001 rs123 A G 20.5 PASS DP=30 GT:X:Y:Z 0/1:0:0:1
chr1 2001 . T C 15.2 PASS DP=25 GT:X:Y:Z 1/1:0:0:1
chr1 3001 rs456 C T 18.7 PASS DP=28 GT:X:Y:Z 0/0:1:3:1
chr1 4001 . G A 22.0 PASS DP=35 GT:X:Y:Z 1/0:1:3:1
chr2 5001 rs789 A T 25.5 PASS DP=40 GT:X:Y:Z 1/1:2:1:1
chr2 6001 . C G 30.1 PASS DP=32 GT:X:Y:Z 0/1:2:1:1
chr2 7001 rs987 T A 17.3 PASS DP=27 GT:X:Y:Z 1/1:2:1:1
chr3 8001 . G C 19.8 PASS DP=22 GT:X:Y:Z 0/0:0:1:0
chr3 9001 rs654 A G 21.4 PASS DP=26 GT:X:Y:Z 1/0:0:1:0
chr3 10001 . T C 23.2 PASS DP=29 GT:X:Y:Z 0/1:0:1:0
A few points to consider:
- In case of a tie in length between two variants, randomly pick one for metadata.
- Extend metadata for variants after the last output variant to the last output variant.
Here I use only one sample as an example, but with multiple samples, the metadata needs to be carried within sample only (i.e. don't use metadata from other samples to expand the file).
Given the large dataset (millions of variants, 1000s of samples), I'm looking for the most optimized and memory-efficient approach. Any suggestions or code snippets would be greatly appreciated.
A bcftools approach would be most appreciated, but I wasn't sure how to go about it.
I considered extracting all INFO columns using:
bcftools query -f '%CHROM %POS[\t%GT\t%X\t%Y\t%Z]\n' file.vcf > file_meta.txt
And I was considering processing it using Python from there, but couldn't figure out a code for that.
I appreciate any and all help. Thank you!