AWS MWAA Airflow task fails with no log and nonsensical duration

585 views Asked by At

I have an MWAA Environment running at AWS. Log levels are all on "ERROR" for Tasks, WebServer, Worker, Scheduler and Dag Processor. At random times, tasks just fail for no reason and no visible logs. We can then see the attached picture showing an immensely long duration, but in reality it only runs a few seconds or minutes, also reflected in the start time (in Run ID) and end time. Set timeout is btw 2 hours.

Failed task information with awkwardly long duration time.

What I've tried: Since some of the DAGs do contain a bunch of tasks, I've tried to make them as little concurrent as possible and schedule accordingly. Other people had identified this as a potential error source. It is mainly the DAGs with many tasks that fail like this.

There are no task errors, but I can find [WARNING] Worker with pid 13924 was terminated due to signal 15 in the WebServer logs, and

[2023-11-02 10:16:30,563: ERROR/MainProcess] Timed out waiting for UP message from <ForkProcess(ForkPoolWorker-44, started daemon)>
[2023-11-02 10:16:30,591: ERROR/MainProcess] Process 'ForkPoolWorker-44' pid:1492 exited with 'signal 9 (SIGKILL)'

The timestamps of these log entries do not clearly correspond to each other, and neither clearly to the failed DAG/task, but occur roughly within the same hour.

Does anyone know what this problem could be, how it can be amended, and especially how the immensely long duration note can come about? I do not wish to increase the instance size on AWS due to cost reasons. I appreciate any hints.

We have about 40 DAGs which run at least daily, some more often. The DAG runs themselves, when they don't fail, are no longer than a few minutes (< 10 mins).

We are running Airflow 2.4.3 with a mw1.small and 2 scheduler counts configuration

1

There are 1 answers

2
it's-yer-boy-chet On

Do you think you're running out of memory on the worker node? I've run into this issue when trying to hold too much data in memory in the past, check the cluster metrics in cloudwatch. For me I found the issue by looking at Maximum statistic the BaseWorker MemoryUtilization metric with Period of 1 minute.

The workers in the mw1.small class have 2GB of memory and 10GB of disk. As far as I know there's not a way to check the disk space usage in metrics, but I think that could cause a SIGKILL of the worker as well.

If you find there's an issue with running out of memory or disk it's probably because you're trying to do more compute in Airflow than it's designed for; maybe try to offload some of the compute to a data warehouse or other AWS service triggered by Airflow like EMR? Make sure you're streaming data to S3 rather than caching locally?

Sorry, I don't have an explanation for the weird duration metric. Have you looked at the gantt to see if there's an issue with the end timestamp maybe?