Cannot sort VCF with bcftools due to invalid input

2.6k views Asked by At

I am trying to compress & index a VCF file and am facing several issues.

  1. When I use bgzip/tabix, it throws an error saying it cannot be indexed due to some unsorted values.
# code used to bgzip and tabix
bgzip -c fn.vcf > fn.vcf.gz
tabix -p vcf fn.vcf.gz

# below is the error returnd
[E::hts_idx_push] Unsorted positions on sequence #1: 115352924 followed by 115352606
tbx_index_build failed: fn.vcf.gz
  1. When I use bcftools sort to sort this VCF to tackle #1, it throws an error due to invalid entries.
# code used to sort 
bcftools sort -O z --output-file fn.vcf.gz fn.vcf

# below is the error returned
Writing to /tmp/bcftools-sort.YSrhjT
[W::vcf_parse_format] Extreme FORMAT/AD value encountered and set to missing at chr12:115350908
[E::vcf_parse_format] Invalid character '\x0F' in 'GT' FORMAT field at chr12:115352482
Error encountered while parsing the input
Cleaning
  1. I've tried sorting using linux commands to get around #2. However, when I run the below code, the size of fout.vcf is almost half of fin.vcf, indicating something might be going wrong.
grep "^#" fin.vcf > fout.vcf
grep -v "^#" fin.vcf | sort -k1,1V -k2,2n >> fout.vcf

Please let me know if you have any advice regarding:

  • How I could sort/fix the problematic inputs in my VCF in a safe & feasible way. (The file is 340G so I cannot simply open the file and edit.)
  • Why my linux sort might be behaving in an odd way. (i.e. returning file much smaller than the original.)

Any comments or suggestions are appreciated!

1

There are 1 answers

0
Briana Loredana On

Try this

mkdir tmp ##1 create a tmp folder in your working directory
tmp=/yourpath/ ##2 assign the tmp folder
bcftools sort file.vcf -T ./tmp -Oz -o file.vcf.gz

you can index your file after sorting your file

bcftools index file.vcf.gz