Airflow - Bashoperator task in GCP Composer

129 views Asked by At

I am using a bash operator to run a shell script which actually launches a dataproc streaming job. THis is never ending job. Thing is this bash operator task going in failed status automatically after 15-16 hrs but I checked dataproc job its still running. Why is bashoperator task becoming FAILED.. any resolutions, suggestions will help

What I tried : From my airflow dag i am invoking a bashoperator task like this

sparkstreaming = BashOperator(
    task_id='sparkstreaming ',
    retries=0,
    bash_command= f'gsutil cp gs://bkt-gcp-spark/start-spark.sh . && bash start-spark.sh ',
    dag=dag
     )

I expect to happen : Airflow UI shoud always how status of this task in GREEN.

What actually resulted : Status of bash operator task automatically became RED (FAILED) after 15-16 hours. But when I go and check spark job it is still running.

2

There are 2 answers

0
Dagang Wei On

The behavior you're experiencing with the Airflow BashOperator task failing after a certain period, despite the Dataproc job continuing to run, is likely due to a timeout or resource constraint within the Airflow environment or the specific configuration of the BashOperator.

Airflow tasks have a default timeout after which they will be marked as failed if they haven't completed. This is defined by the execution_timeout parameter in your task definition. If your BashOperator doesn't have this parameter set, it might be inheriting a default timeout from Airflow's configuration, causing it to fail after a certain period.

Resolution: Increase the execution_timeout parameter of your BashOperator to a value that exceeds the expected runtime of your Dataproc streaming job, or set it to None if you want to disable the timeout entirely (though this is not recommended for production environments).

from datetime import timedelta

sparkstreaming = BashOperator(
    task_id='sparkstreaming',
    retries=0,
    bash_command='gsutil cp gs://bkt-gcp-spark/start-spark.sh . && bash start-spark.sh',
    execution_timeout=timedelta(days=2),  # Example: 2 days timeout
    dag=dag
)
0
Fahed Sabellioglu On

Airflow is used for developing, scheduling, and monitoring batch-oriented workflows. Streaming jobs are indefinite, so you can't have an always-running Airflow task, and setting high timeouts is not the solution to this issue. This will already occupy your Airflow resources for nothing.

Please use Cloud Build to deploy your workload to dev, stage, and prod environments. Please also use appropriate logging to be able to use log-based alerting policies to be notified whenever an issue happens or your job fails.