This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
[AWS-MWAA] Sending SIGTERM due to status change detection after task completion #21497
Closed
1 of 2 tasks
Apache Airflow version
2.0.2
What happened
We are running
2.0.2
version of Airflow in AWS MWAA.After a point in time, our MWAA setup started producing errors on ALL tasks asynchronously.
It is not related to the type of task since it is happening in ALL tasks (Plain Python operators and ECS Operators) in ALL Dags.
More specifically:
1. Initially, the task completes successfully and pipeline follows to next tasks as seen in the lines:
2. However after 5 seconds an extra read-state functionality is triggered on the task that somehow thinks that the task's state was marked externally and starts trying to terminate it:
The task though is already terminated so the SIGTERM signal sent times out and an arbitrary ERROR is returned as -9 stating that this could be an out of memory error. This is not the case however since this is just assigned as an explicit code when the process cannot be found as seen here, so it finally errors out with:
I tried removing all custom configuration entries in MWAA as well as at the same time revert DAGs to a previous version that did not produce these errors but without any result.
UPDATE
I found exactly the same issue in #13824 as well as the root cause in #16227 and its PR-Solution #16289
Since MWAA on AWS is stuck on updates is there a suggested way to tackle this except for setting the
AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION
variable toFalse
which will affect performance?Also strange that this was not happening before and it started now... I also saw that it could be related to Database load... Could this be the case?
What you expected to happen
I would expect the tasks to terminate gracefully after successful completion as shown in:
and not moving on with the state change situation that is marked for termination..
Complete log below:
How to reproduce
This happens for every task even for small Python ShortCircuitOperators like:
Operating System
AWS MWAA - class: mw1.large
Versions of Apache Airflow Providers
As taken from MWAA doc page and requirements link (althought it might be outdated):
Deployment
MWAA
Deployment details
Some custom options have been tried but also reverted to make sure that they are not responsible for the error:
Anything else
It occurs everytime
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: