Run Luigi task that depends on another task

524 views Asked by At

I have one task SeqrMTToESTask that depends on another one called SeqrVCFToMTTask. You can see the full code here:

Now, I ran the first task separately in the terminal and generated the output file - When I launch the second task - SeqrMTToESTask I would expect it to check the output of the first task - and if it is present, take the file and go ahead, but it is not what is happening. Instead of that I am getting the error that signifies that certain parameters to the first task are missing, e.g.:

luigi.parameter.MissingParameterException: SeqrVCFToMTTask[args=(), kwargs={}]: requires the 'source_paths' parameter to be set

The full command that I use to run the second task is:

python -u gcloud_dataproc/ --cpu-limit 4 --num-executors 1 --hail-version 0.2 
--run-locally luigi_pipeline/ SeqrMTToESTask --local-scheduler 
--dest-file hdfs://.../seqr-loading-test/_SUCCESS_TO_ES --source-path hdfs://.../seqr-loading-test/ 
--spark-home $SPARK_HOME --es-host cp-nodedev1 --es-port 7890 --es-index sample_luigi

So, my question here is the following: how I should run luigi task with spark (gcloud_dataproc/ just constructs the command that uses spark-submit) that depends on other task with its own required parameters?


There are 1 answers

Nikita Vlasenko On

Apparently the right way to go was to just use luigi config file (in my case seqr-loading-local-GRCh37.cfg) file where we specify all of the parameters for all of the tasks. So, after specifying all of the parameters for the tasks I was able to run it in the following way:

LUIGI_CONFIG_PATH=luigi_pipeline/configs/seqr-loading-local-GRCh37.cfg python 
-u gcloud_dataproc/ --cpu-limit 4 --num-executors 1 --hail-version 0.2 
--run-locally luigi_pipeline/ SeqrMTToESTask --local-scheduler --spark-home $SPARK_HOME