I have one task SeqrMTToESTask
that depends on another one called SeqrVCFToMTTask
. You can see the full code here:
Now, I ran the first task separately in the terminal and generated the output file - sample.mt
. When I launch the second task - SeqrMTToESTask
I would expect it to check the output of the first task - sample.mt
and if it is present, take the file and go ahead, but it is not what is happening. Instead of that I am getting the error that signifies that certain parameters to the first task are missing, e.g.:
luigi.parameter.MissingParameterException: SeqrVCFToMTTask[args=(), kwargs={}]: requires the 'source_paths' parameter to be set
The full command that I use to run the second task is:
python -u gcloud_dataproc/submit.py --cpu-limit 4 --num-executors 1 --hail-version 0.2
--run-locally luigi_pipeline/seqr_loading.py SeqrMTToESTask --local-scheduler
--dest-file hdfs://.../seqr-loading-test/_SUCCESS_TO_ES --source-path hdfs://.../seqr-loading-test/sample.mt
--spark-home $SPARK_HOME --es-host cp-nodedev1 --es-port 7890 --es-index sample_luigi
So, my question here is the following: how I should run luigi
task with spark (gcloud_dataproc/submit.py
just constructs the command that uses spark-submit
) that depends on other task with its own required parameters?
Apparently the right way to go was to just use luigi config file (in my case
seqr-loading-local-GRCh37.cfg
) file where we specify all of the parameters for all of the tasks. So, after specifying all of the parameters for the tasks I was able to run it in the following way: