I am trying to compress & index a VCF file and am facing several issues.
- When I use bgzip/tabix, it throws an error saying it cannot be indexed due to some unsorted values.
# code used to bgzip and tabix
bgzip -c fn.vcf > fn.vcf.gz
tabix -p vcf fn.vcf.gz
# below is the error returnd
[E::hts_idx_push] Unsorted positions on sequence #1: 115352924 followed by 115352606
tbx_index_build failed: fn.vcf.gz
- When I use
bcftools sort
to sort this VCF to tackle #1, it throws an error due to invalid entries.
# code used to sort
bcftools sort -O z --output-file fn.vcf.gz fn.vcf
# below is the error returned
Writing to /tmp/bcftools-sort.YSrhjT
[W::vcf_parse_format] Extreme FORMAT/AD value encountered and set to missing at chr12:115350908
[E::vcf_parse_format] Invalid character '\x0F' in 'GT' FORMAT field at chr12:115352482
Error encountered while parsing the input
Cleaning
- I've tried sorting using linux commands to get around #2. However, when I run the below code, the size of
fout.vcf
is almost half offin.vcf
, indicating something might be going wrong.
grep "^#" fin.vcf > fout.vcf
grep -v "^#" fin.vcf | sort -k1,1V -k2,2n >> fout.vcf
Please let me know if you have any advice regarding:
- How I could sort/fix the problematic inputs in my VCF in a safe & feasible way. (The file is 340G so I cannot simply open the file and edit.)
- Why my linux
sort
might be behaving in an odd way. (i.e. returning file much smaller than the original.)
Any comments or suggestions are appreciated!
Try this
you can index your file after sorting your file