Hello StackOverflow community,
I'm dealing with a large VCF file containing default columns and phased genotypes for multiple samples. After processing it with another tool, I end up with a subset of variants and additional metadata for each sample. My goal is to expand the second file to include all variants from the first file, using metadata from the closest variant position.
Additional info for non-bioinformaticians:
- Each row is a variant, with two position details (CHROM and POS).
- When we expand the file, I basically intend to identify the variant from larger file and find which variant it is closest to in the processed file and then copy the metadata to output file.
- Both (original VCF file and output VCF file) are sorted files.
Original VCF file (excerpt):
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1
chr1    1001    rs123   A       G       20.5    PASS    DP=30   GT      0/1
chr1    2001    .       T       C       15.2    PASS    DP=25   GT      1/1
chr1    3001    rs456   C       T       18.7    PASS    DP=28   GT      0/0
chr1    4001    .       G       A       22.0    PASS    DP=35   GT      1/0
chr2    5001    rs789   A       T       25.5    PASS    DP=40   GT      1/1
chr2    6001    .       C       G       30.1    PASS    DP=32   GT      0/1
chr2    7001    rs987   T       A       17.3    PASS    DP=27   GT      1/1
chr3    8001    .       G       C       19.8    PASS    DP=22   GT      0/0
chr3    9001    rs654   A       G       21.4    PASS    DP=26   GT      1/0
chr3    10001   .       T       C       23.2    PASS    DP=29   GT      0/1
Processed VCF (excerpt):
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT    Sample1
chr1    1001    rs123   A       G       20.5    PASS    DP=30   GT:X:Y:Z  0/1:0:0:1
chr1    3001    rs456   C       T       18.7    PASS    DP=28   GT:X:Y:Z  0/0:1:3:1
chr2    6001    .       C       G       30.1    PASS    DP=32   GT:X:Y:Z  0/1:2:1:1
chr3    9001    .       T       C       23.2    PASS    DP=29   GT:X:Y:Z  0/1:0:1:0
Desired Final Output:
chr1    1001    rs123   A       G       20.5    PASS    DP=30   GT:X:Y:Z  0/1:0:0:1
chr1    2001    .       T       C       15.2    PASS    DP=25   GT:X:Y:Z  1/1:0:0:1
chr1    3001    rs456   C       T       18.7    PASS    DP=28   GT:X:Y:Z  0/0:1:3:1
chr1    4001    .       G       A       22.0    PASS    DP=35   GT:X:Y:Z  1/0:1:3:1
chr2    5001    rs789   A       T       25.5    PASS    DP=40   GT:X:Y:Z  1/1:2:1:1
chr2    6001    .       C       G       30.1    PASS    DP=32   GT:X:Y:Z  0/1:2:1:1
chr2    7001    rs987   T       A       17.3    PASS    DP=27   GT:X:Y:Z  1/1:2:1:1
chr3    8001    .       G       C       19.8    PASS    DP=22   GT:X:Y:Z  0/0:0:1:0
chr3    9001    rs654   A       G       21.4    PASS    DP=26   GT:X:Y:Z  1/0:0:1:0
chr3    10001   .       T       C       23.2    PASS    DP=29   GT:X:Y:Z  0/1:0:1:0
A few points to consider:
- In case of a tie in length between two variants, randomly pick one for metadata.
- Extend metadata for variants after the last output variant to the last output variant.
Here I use only one sample as an example, but with multiple samples, the metadata needs to be carried within sample only (i.e. don't use metadata from other samples to expand the file).
Given the large dataset (millions of variants, 1000s of samples), I'm looking for the most optimized and memory-efficient approach. Any suggestions or code snippets would be greatly appreciated.
A bcftools approach would be most appreciated, but I wasn't sure how to go about it.
I considered extracting all INFO columns using:
bcftools query -f '%CHROM %POS[\t%GT\t%X\t%Y\t%Z]\n' file.vcf > file_meta.txt
And I was considering processing it using Python from there, but couldn't figure out a code for that.
I appreciate any and all help. Thank you!