-
Apache Airflow version2.0.2 What happenedWe are running After a point in time, our MWAA setup started producing errors on ALL tasks asynchronously. More specifically: 1. Initially, the task completes successfully and pipeline follows to next tasks as seen in the lines:
2. However after 5 seconds an extra read-state functionality is triggered on the task that somehow thinks that the task's state was marked externally and starts trying to terminate it:
The task though is already terminated so the SIGTERM signal sent times out and an arbitrary ERROR is returned as -9 stating that this could be an out of memory error. This is not the case however since this is just assigned as an explicit code when the process cannot be found as seen here, so it finally errors out with:
I tried removing all custom configuration entries in MWAA as well as at the same time revert DAGs to a previous version that did not produce these errors but without any result. UPDATEI found exactly the same issue in #13824 as well as the root cause in #16227 and its PR-Solution #16289 Since MWAA on AWS is stuck on updates is there a suggested way to tackle this except for setting the Also strange that this was not happening before and it started now... I also saw that it could be related to Database load... Could this be the case? What you expected to happenI would expect the tasks to terminate gracefully after successful completion as shown in:
and not moving on with the state change situation that is marked for termination.. Complete log below:
How to reproduceThis happens for every task even for small Python ShortCircuitOperators like: def validate_rate_config(bucket: str, key: str) -> bool:
s3 = S3Hook(aws_conn_id="aws_default")
return s3.check_for_key(key, bucket)
rate_config_validator = ShortCircuitOperator(
dag=dag_instance,
task_id=f"rate_config_validator_i{index}",
op_kwargs={"bucket": rates_s3_bucket, "key": inputs_key_prefix + f"/input_{index}.json"},
python_callable=validate_rate_config,
retries=2,
) Operating SystemAWS MWAA - class: mw1.large Versions of Apache Airflow ProvidersAs taken from MWAA doc page and requirements link (althought it might be outdated):
DeploymentMWAA Deployment detailsSome custom options have been tried but also reverted to make sure that they are not responsible for the error:
Anything elseIt occurs everytime Are you willing to submit PR?
Code of Conduct
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 9 replies
-
Thanks for opening your first issue here! Be sure to follow the issue template! |
Beta Was this translation helpful? Give feedback.
-
Also MWAA supports 2.2. line of Airflow (with no easy migration though) so I'd suggest this one also as a possibilitty - Airflow 2.2 has quite a number of improvements and fixes comparing to 2.0, not mentioning some new features - and I'd say trying to implement any kind of other workaround might be more costly than migration to 2.2. So if you are considering some investment - I propose to migrate to 2.2. Moving it into discussion as it is not an issue any more. |
Beta Was this translation helpful? Give feedback.
AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION
variable to False affect performance just a little (it basically makes sure that there are as small gaps between finishing the task and starting the depending tasks). It does not affect "overall" performance of scheduling, just latency of specific cases. So I'd say it's a reasonable workaround.Also MWAA supports 2.2. line of Airflow (with no easy migration though) so I'd suggest this one also as a possibilitty - Airflow 2.2 has quite a number of improvements and fixes comparing to 2.0, not mentioning some new features - and I'd say trying to implement any kind of other workaround might be more costly than migration to 2.2. So if you are c…