I'm trying to run DeepVariant on my BAM file to produce a VCF. I have the following questions:
1 - The alignment is in GRCh38, which model should I use. Can I use the standard whole exome sequence model? ('gs://deepvariant/models/DeepVariant/0.7.0/DeepVariant-inception_v3-0.7.0+data-wes_standard')
2 - Which BED file to use to specify the exome regions? Is there a standard one? I found one here that I am using now ("CDS-cannonical.bed"): https://github.com/AstraZeneca-NGS/reference_data/tree/master/hg38/bed
3 - I'm using the Verily GRCh38 genome, is there a standard GRCh38 alignment available on google genomics. This is the one I have: --ref gs://genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa \
I've setup my script as follows, please let me know if it makes sense:
#!/bin/bash
set -euo pipefail
# Set common settings.
PROJECT_ID=valis-194104
OUTPUT_BUCKET=gs://canis/CNR-data
STAGING_FOLDER_NAME=deep_variant_files
OUTPUT_FILE_NAME=TLE_a_001.vcf
# Model for calling whole exome sequencing data.
MODEL=gs://deepvariant/models/DeepVariant/0.7.0/DeepVariant-inception_v3-0.7.0+data-wes_standard
IMAGE_VERSION=0.7.0
DOCKER_IMAGE=gcr.io/deepvariant-docker/deepvariant:"${IMAGE_VERSION}"
COMMAND="/opt/deepvariant_runner/bin/gcp_deepvariant_runner \
--project ${PROJECT_ID} \
--zones us-west1-b \
--docker_image ${DOCKER_IMAGE} \
--outfile ${OUTPUT_BUCKET}/${OUTPUT_FILE_NAME} \
--staging ${OUTPUT_BUCKET}/${STAGING_FOLDER_NAME} \
--model ${MODEL} \
--regions gs://canis/CNR-data/CDS-canonical.bed \
--bam gs://canis/CNR-data/TLE_a_001_R_2014_09_17_16_35_30_user_WAL-19-TLE_17_09_2014_Auto_user_WAL-19-TLE_17_09_2014_57.bam \
--ref gs://genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa \
--gcsfuse"
# Run the pipeline.
gcloud alpha genomics pipelines run \
--project "${PROJECT_ID}" \
--service-account-scopes="https://www.googleapis.com/auth/cloud-platform" \
--logging "${OUTPUT_BUCKET}/${STAGING_FOLDER_NAME}/runner_logs_$(date +%Y%m%d_%H%M%S).log" \
--zones us-west1-b \
--docker-image gcr.io/deepvariant-docker/deepvariant_runner:"${IMAGE_VERSION}" \
--command-line "${COMMAND}"
EDIT:
I attempted to add in a .bam.bai file (bam index) as generated with samtools
I still get an error:
Traceback (most recent call last):
File "/opt/deepvariant_runner/src/gcp_deepvariant_runner.py", line 862, in <module>
run()
File "/opt/deepvariant_runner/src/gcp_deepvariant_runner.py", line 845, in run
_run_make_examples(pipeline_args)
File "/opt/deepvariant_runner/src/gcp_deepvariant_runner.py", line 340, in _run_make_examples
_wait_for_results(threads, results)
File "/opt/deepvariant_runner/src/gcp_deepvariant_runner.py", line 352, in _wait_for_results
result.get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 572, in get
raise self._value
RuntimeError: Job failed with error "run": operation "projects/valis-194104/operations/13939489157244551677" failed: executing pipeline: Execution failed: action 5: unexpected exit status 1 was not ignored (reason: FAILED_PRECONDITION)
details:
1- The model works on any version of the reference genome. You do need to make sure your BAM file matches with the reference genome you provide.
2- It depends on where your exome BAM file comes from and what the corresponding capture region BED is. Sometimes running
samtools view -H
on the BAM file will tell you which capture region was used to generate it.3- I just took a quick look through this: it should work. There are a few common failure modes that we're hoping to make more robust in the future: for example, I think currently there's an assumption that you need to have a corresponding indexed BAI file named *.bam.bai under the same directory. The safest thing is to provide a
--bai
flag pointing to your BAI file (like the example in https://cloud.google.com/genomics/docs/tutorials/deepvariant). Similarly, this pipeline will fail if it can't find a index file for the FASTA file. It seems like gs://genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa.fai exists, so that one should be covered.Let us know if you end up encountering any issues. We hope to improve the usability for both DeepVariant and the Google Cloud runner, so your feedback is very valuable to us.
In the future, also feel free to use our GitHub issue for any questions or discussions. Our team closely monitors all issues there: https://github.com/google/deepvariant/issues