I am trying to run gcloud beta lifesciences because genomics API is deprecated. There have been so many changes, genomics API vs lifesciences API.
I ran one of my analysis step in google clooud using beta lifesciences. Here is what I found. (1) wildcard is not working in command line options (2) It is not easy to set the target directory in command line option, I used env-var for copy.
I am now trying to convert commandline option into JSON format pipeline-file, but it is not easy to understand help page in google cloud. Do you have an idea how to convert following options into JSON file, so I could run it with simpler option?
I used YAML formatted pipeline file in genomics API, but beta lifescienes is totally different.
$ more step03_bwa_mem_genome1.run
#SMALL=
SMALL=chr21.
LIFESCIENCESPATH=/gcloud-shared
#LIFESCIENCESPATH=/mnt
SCRIPTFILENAME=step03_bwa_mem_genome.sh
COHORTID=2_C_222
gcloud beta lifesciences pipelines run \
--logging gs://${BUCKETID}/ExomeSeq/hResults/step03_bwa_mem_genome.${COHORTID}.log \
--regions=asia-northeast1,asia-northeast2,asia-northeast3,asia-east1,asia-east2,asia-south1 \
--boot-disk-size 20 \
--preemptible \
--machine-type n1-standard-1 \
--disk-size "gcloud-shared:10" \
--docker-image asia.gcr.io/thermal-shuttle-199104/centos8-essential-software-genomics-custom-python3:0.4 \
--inputs REFERENCE1=gs://${BUCKETID}/ExomeSeq/hReference/GRCh38.primary_assembly.genome.${SMALL}fa.amb \
--inputs REFERENCE2=gs://${BUCKETID}/ExomeSeq/hReference/GRCh38.primary_assembly.genome.${SMALL}fa.ann \
--inputs REFERENCE3=gs://${BUCKETID}/ExomeSeq/hReference/GRCh38.primary_assembly.genome.${SMALL}fa.bwt \
--inputs REFERENCE4=gs://${BUCKETID}/ExomeSeq/hReference/GRCh38.primary_assembly.genome.${SMALL}fa.fai \
--inputs REFERENCE5=gs://${BUCKETID}/ExomeSeq/hReference/GRCh38.primary_assembly.genome.${SMALL}fa.intervals \
--inputs REFERENCE6=gs://${BUCKETID}/ExomeSeq/hReference/GRCh38.primary_assembly.genome.${SMALL}fa.pac \
--inputs REFERENCE7=gs://${BUCKETID}/ExomeSeq/hReference/GRCh38.primary_assembly.genome.${SMALL}fa.sa \
--inputs SCRIPTFILE=gs://${BUCKETID}/ExomeSeq/${SCRIPTFILENAME} \
--inputs COHORTID=${COHORTID} \
--inputs SAMPLELIST=gs://${BUCKETID}/ExomeSeq/SAMPLELIST.${COHORTID}.lst \
--inputs INPUTFILE1=gs://${BUCKETID}/ExomeSeq/hReads/${COHORTID}_01_1.chr21.fastq.gz \
--inputs INPUTFILE2=gs://${BUCKETID}/ExomeSeq/hReads/${COHORTID}_01_2.chr21.fastq.gz \
--inputs INPUTFILE3=gs://${BUCKETID}/ExomeSeq/hReads/${COHORTID}_02_1.chr21.fastq.gz \
--inputs INPUTFILE4=gs://${BUCKETID}/ExomeSeq/hReads/${COHORTID}_02_2.chr21.fastq.gz \
--inputs INPUTFILE5=gs://${BUCKETID}/ExomeSeq/hReads/${COHORTID}_03_1.chr21.fastq.gz \
--inputs INPUTFILE6=gs://${BUCKETID}/ExomeSeq/hReads/${COHORTID}_03_2.chr21.fastq.gz \
--outputs OUTPUTFILE1=gs://${BUCKETID}/ExomeSeq/hResults/${COHORTID}_01.bam \
--outputs OUTPUTFILE2=gs://${BUCKETID}/ExomeSeq/hResults/${COHORTID}_02.bam \
--outputs OUTPUTFILE3=gs://${BUCKETID}/ExomeSeq/hResults/${COHORTID}_03.bam \
--env-vars REFERENCE1=${LIFESCIENCESPATH}/ExomeSeq/hReference/GRCh38.primary_assembly.genome.${SMALL}fa.amb,REFERENC
E2=${LIFESCIENCESPATH}/ExomeSeq/hReference/GRCh38.primary_assembly.genome.${SMALL}fa.ann,REFERENCE3=${LIFESCIENCESPATH}/
ExomeSeq/hReference/GRCh38.primary_assembly.genome.${SMALL}fa.bwt,REFERENCE4=${LIFESCIENCESPATH}/ExomeSeq/hReference/GRC
h38.primary_assembly.genome.${SMALL}fa.fai,REFERENCE5=${LIFESCIENCESPATH}/ExomeSeq/hReference/GRCh38.primary_assembly.ge
nome.${SMALL}fa.intervals,REFERENCE6=${LIFESCIENCESPATH}/ExomeSeq/hReference/GRCh38.primary_assembly.genome.${SMALL}fa.p
ac,REFERENCE7=${LIFESCIENCESPATH}/ExomeSeq/hReference/GRCh38.primary_assembly.genome.${SMALL}fa.sa,SCRIPTFILE=${LIFESCIE
NCESPATH}/ExomeSeq/${SCRIPTFILENAME},SAMPLELIST=${LIFESCIENCESPATH}/ExomeSeq/SAMPLELIST.${COHORTID}.lst,INPUTFILE1=${LIF
ESCIENCESPATH}/ExomeSeq/hReads/${COHORTID}_01_1.chr21.fastq.gz,INPUTFILE2=${LIFESCIENCESPATH}/ExomeSeq/hReads/${COHORTID
}_01_2.chr21.fastq.gz,INPUTFILE3=${LIFESCIENCESPATH}/ExomeSeq/hReads/${COHORTID}_02_1.chr21.fastq.gz,INPUTFILE4=${LIFESC
IENCESPATH}/ExomeSeq/hReads/${COHORTID}_02_2.chr21.fastq.gz,INPUTFILE5=${LIFESCIENCESPATH}/ExomeSeq/hReads/${COHORTID}_0
3_1.chr21.fastq.gz,INPUTFILE6=${LIFESCIENCESPATH}/ExomeSeq/hReads/${COHORTID}_03_2.chr21.fastq.gz,OUTPUTFILE1=${LIFESCIE
NCESPATH}/ExomeSeq/hResults/${COHORTID}_01.bam,OUTPUTFILE2=${LIFESCIENCESPATH}/ExomeSeq/hResults/${COHORTID}_02.bam,OUTP
UTFILE3=${LIFESCIENCESPATH}/ExomeSeq/hResults/${COHORTID}_03.bam \
--command-line="find ${LIFESCIENCESPATH}; /bin/bash ${LIFESCIENCESPATH}/ExomeSeq/${SCRIPTFILENAME} ${COHORTID} 4"
I would generally recommend using something like Cromwell, Nextflow, or Snakemake rather than using either API directly. They have more built in functionality for these types of tasks.
However, the output from
gcloud beta lifesciences operations describe <operation name>
will include the pipeline definition that gcloud created, which could be used as a starting point. One thing you'll notice there is that--inputs
and--outputs
automatically create environment variables so theLIFESCIENCESPATH
variable and the--env-vars
parameter are unnecessary which will simplify the command line significantly.