Just getting started with Airflow and wondering what best practices are for structuring large DAGs. For our ETL, we have a lots of tasks that fall into logical groupings, yet the groups are dependent on each other. Which of the following would be considered best practice?
- One large DAG file with all tasks in that file
- Splitting the DAG definition across multiple files (How to do this?)
- Define multiple DAGs, one for each group of tasks, and set dependencies between them using ExternalTaskSensor
Also open to other suggestions.
DAGs are just python files. So you could split a single dag definition into multiple files. The different files should just have methods that take in a dag object and create tasks using that dag object.
Note though, you should just a single dag object in the global scope. Airflow picks up all dag objects in the global scope as separate dags.
It is often considered good practice to keep each dag as concise as possible. However if you need to set up such dependencies you could either consider using subdags. More about this here: https://airflow.incubator.apache.org/concepts.html?highlight=subdag#scope
You could also use ExternalTaskSensor but beware that as the number of dags grow, it might get harder to handle external dependencies between tasks. I think subdags might be the way to go for your use case.