Running DeepVariant on GRCh38 Whole Exome Sequence

451 views Asked by At

I'm trying to run DeepVariant on my BAM file to produce a VCF. I have the following questions:

1 - The alignment is in GRCh38, which model should I use. Can I use the standard whole exome sequence model? ('gs://deepvariant/models/DeepVariant/0.7.0/DeepVariant-inception_v3-0.7.0+data-wes_standard')

2 - Which BED file to use to specify the exome regions? Is there a standard one? I found one here that I am using now ("CDS-cannonical.bed"): https://github.com/AstraZeneca-NGS/reference_data/tree/master/hg38/bed

3 - I'm using the Verily GRCh38 genome, is there a standard GRCh38 alignment available on google genomics. This is the one I have: --ref gs://genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa \

I've setup my script as follows, please let me know if it makes sense:

#!/bin/bash
set -euo pipefail
# Set common settings.
PROJECT_ID=valis-194104
OUTPUT_BUCKET=gs://canis/CNR-data
STAGING_FOLDER_NAME=deep_variant_files
OUTPUT_FILE_NAME=TLE_a_001.vcf
# Model for calling whole exome sequencing data.
MODEL=gs://deepvariant/models/DeepVariant/0.7.0/DeepVariant-inception_v3-0.7.0+data-wes_standard
IMAGE_VERSION=0.7.0
DOCKER_IMAGE=gcr.io/deepvariant-docker/deepvariant:"${IMAGE_VERSION}"
COMMAND="/opt/deepvariant_runner/bin/gcp_deepvariant_runner \
  --project ${PROJECT_ID} \
  --zones us-west1-b \
  --docker_image ${DOCKER_IMAGE} \
  --outfile ${OUTPUT_BUCKET}/${OUTPUT_FILE_NAME} \
  --staging ${OUTPUT_BUCKET}/${STAGING_FOLDER_NAME} \
  --model ${MODEL} \
  --regions gs://canis/CNR-data/CDS-canonical.bed \
  --bam gs://canis/CNR-data/TLE_a_001_R_2014_09_17_16_35_30_user_WAL-19-TLE_17_09_2014_Auto_user_WAL-19-TLE_17_09_2014_57.bam \
  --ref gs://genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa  \
  --gcsfuse"
# Run the pipeline.
gcloud alpha genomics pipelines run \
    --project "${PROJECT_ID}" \
    --service-account-scopes="https://www.googleapis.com/auth/cloud-platform" \
    --logging "${OUTPUT_BUCKET}/${STAGING_FOLDER_NAME}/runner_logs_$(date +%Y%m%d_%H%M%S).log" \
    --zones us-west1-b \
    --docker-image gcr.io/deepvariant-docker/deepvariant_runner:"${IMAGE_VERSION}" \
    --command-line "${COMMAND}"

EDIT:

I attempted to add in a .bam.bai file (bam index) as generated with samtools

I still get an error:

  Traceback (most recent call last):
    File "/opt/deepvariant_runner/src/gcp_deepvariant_runner.py", line 862, in <module>
      run()
    File "/opt/deepvariant_runner/src/gcp_deepvariant_runner.py", line 845, in run
      _run_make_examples(pipeline_args)
    File "/opt/deepvariant_runner/src/gcp_deepvariant_runner.py", line 340, in _run_make_examples
      _wait_for_results(threads, results)
    File "/opt/deepvariant_runner/src/gcp_deepvariant_runner.py", line 352, in _wait_for_results
      result.get()
    File "/usr/lib/python2.7/multiprocessing/pool.py", line 572, in get
      raise self._value
  RuntimeError: Job failed with error "run": operation "projects/valis-194104/operations/13939489157244551677" failed: executing pipeline: Execution failed: action 5: unexpected exit status 1 was not ignored (reason: FAILED_PRECONDITION)
details:
1

There are 1 answers

2
pcchang On

1- The model works on any version of the reference genome. You do need to make sure your BAM file matches with the reference genome you provide.

2- It depends on where your exome BAM file comes from and what the corresponding capture region BED is. Sometimes running samtools view -H on the BAM file will tell you which capture region was used to generate it.

3- I just took a quick look through this: it should work. There are a few common failure modes that we're hoping to make more robust in the future: for example, I think currently there's an assumption that you need to have a corresponding indexed BAI file named *.bam.bai under the same directory. The safest thing is to provide a --bai flag pointing to your BAI file (like the example in https://cloud.google.com/genomics/docs/tutorials/deepvariant). Similarly, this pipeline will fail if it can't find a index file for the FASTA file. It seems like gs://genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa.fai exists, so that one should be covered.

Let us know if you end up encountering any issues. We hope to improve the usability for both DeepVariant and the Google Cloud runner, so your feedback is very valuable to us.

In the future, also feel free to use our GitHub issue for any questions or discussions. Our team closely monitors all issues there: https://github.com/google/deepvariant/issues