Apache Airflow Distributed Processing

1.7k views Asked by At

My mind is confused over the architecture over Apache Airflow.

If I know, When you executed a hql or sqoop statement in oozie, oozie is directing the request to the data nodes.

I want to achieve same thing in Apache Airflow. I want to execute a shell script, hql or sqoop command, and I wanna be sure that my command is being executed distributely by data nodes. Airflow have different executor types. What should I do in order to run commands in different data nodes concurrently?

2

There are 2 answers

0
x97Core On BEST ANSWER

It seems you want to execute your tasks on distributed workers. In that case, consider using CeleryExecutor.

CeleryExecutor is one of the ways you can scale out the number of workers. For this to work, you need to setup a Celery backend (RabbitMQ, Redis, …) and change your airflow.cfg to point the executor parameter to CeleryExecutor and provide the related Celery settings.

See: https://airflow.apache.org/configuration.html#scaling-out-with-celery

0
Amit Kumar On

Oozie is tightly coupled with Hadoop nodes and all the scripts need to be uploaded to HDFS, whereas Airflow with Celery Executor has a better architecture. With Celery executor the same script, hql can be executed concurrently in multiple nodes as well as specific nodes by using correct queues and some workers can listen to the specific queues to perform those actions.