How to get airflow to add thousands of tasks to celery at one time?

Question

How to get airflow to add thousands of tasks to celery at one time?

2.6k views Asked by Kevin Pauli At 26 June 2018 at 22:06

I'm evaluating Airflow 1.9.0 for our distributed orchestration needs (using CeleryExecutor and RabbitMQ), and I am seeing something strange.

I made a dag that has three stages:

start
fan out and run N tasks concurrently
finish

N can be large, maybe up to 10K. I would expect to see N tasks get dumped onto the Rabbit queue when stage 2 begins. Instead I am seeing only a few hundred added at a time. As the workers process the tasks and the queue gets smaller, then more get added to Celery/Rabbit. Eventually, it does finish, however I would really prefer that it dump ALL the work (all 10K tasks) into Celery immediately, for two reasons:

The current way makes the scheduler long-lived and stateful. The scheduler might die after only 5K have completed, in which case the remaining 5K tasks would never get added (I verified this)
I want to use the size of the Rabbit queue as metric to trigger autoscaling events to add more workers. So I need a true picture of how much outstanding work remains (10K, not a few hundred)

I assume the scheduler has some kind of throttle that keeps it from dumping all 10K messages simultaneously? If so is this configurable?

FYI I have already set “parallelism” to 10K in the airflow.cfg

Here is my test dag:

# This dag tests how well airflow fans out

from airflow import DAG
from datetime import datetime, timedelta

from airflow.operators.bash_operator import BashOperator

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2015, 6, 1),
    'email': ['[email protected]'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG('fan_out', default_args=default_args, schedule_interval=None)

num_tasks = 10000

starting = BashOperator(
    task_id='starting',
    bash_command='echo starting',
    dag=dag
)

all_done = BashOperator(
    task_id='all_done',
    bash_command='echo all done',
    dag=dag)

for i in range(0, num_tasks):
    task = BashOperator(
        task_id='say_hello_' + str(i),
        bash_command='echo hello world',
        dag=dag)
    task.set_upstream(starting)
    task.set_downstream(all_done)

Original Q&A

There are 2 answers

cwurtz On 27 June 2018 at 01:46

There are a couple other settings you'll want to increase.

Under [core] increase non_pooled_task_slot_count. This will allow more tasks to actually be queued up in celery.

Under [celery] increase celeryd_concurrency. This will increase the number of tasks each worker will attempt to run from the queue at the same time.

That being said, in response to your first reason...

While true, the remaining tasks won't get queued if the scheduler isn't running, but this is because the Airflow scheduler is designed to be long lived. It should always be running when your workers are running. Should the scheduler be killed or die for whatever reason, once it starts back up it will pick up where it left off.

**Kevin Pauli** · Accepted Answer · 2018-06-28T18:43:15+00:00

Thanks to those who suggested other concurrency settings. Through trial and error I learned that I need to set all three of these:

 - AIRFLOW__CORE__PARALLELISM=10000
 - AIRFLOW__CORE__NON_POOLED_TASK_SLOT_COUNT=10000
 - AIRFLOW__CORE__DAG_CONCURRENCY=10000

With only these two enabled, I can get to 10K but it is very slow, only adding 100 new tasks in bursts every 30 seconds, in a stair-step fashion:

 - AIRFLOW__CORE__PARALLELISM=10000
 - AIRFLOW__CORE__NON_POOLED_TASK_SLOT_COUNT=10000

If I only enable these two, it is the same "stair-step" pattern, with 128 added every 30 seconds:

 - AIRFLOW__CORE__PARALLELISM=10000
 - AIRFLOW__CORE__DAG_CONCURRENCY=10000

But if I set all three, it does add 10K to the queue in one shot.

TechQA.

How to get airflow to add thousands of tasks to celery at one time?

There are 2 answers

Related Questions in AIRFLOW

Related Questions in AIRFLOW-SCHEDULER

Popular Questions

Popular Tags

Trending Questions