I have a list of VCF files from specific ethnicity such as American Indian, Chinese, European, etc
Under each ethnicity, I have around 100+ files.
Currently, I computed the VARIANT QC
metrics such as call_rate
, n_het
etc for one file as shown in the hail tutorial (refer image below)
However, now I would like to have one file for each ethnicity and then compute VARIANT_QC
metrics.
I already referred to this post and this post but don't think this addresses my query
How can I do this across all files under a specific ethnicity?
Can help me with this?
Is there any hail/python/R/other tools
way to do this?
You could use Variant Transforms to achieve this goal. Variant Transforms is a tool for parsing and importing VCF files into BigQuery. It also can perform the reverse transform: export variants stored in BigQuery tables to VCF file. So basically you need to:
multiple VCF files -> BigQuery -> Single VCF file
Variant Transforms can easily handle multiple input files. It also can perform more complex logic to merge same variants across multiple files into the same record. After your variants are all loaded into BigQuery you could export them to VCF file.
Note that Variant Transforms creates a separate table for each chromosome to optimize query costs. You can easily create a VCF file for each chromosome and then merge them together to create a single one.
You can reach out to Variant Transforms team if you need help with this task.