-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Triggerer intermittent failure when running many triggerers #32091
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
PR #32092 submitted |
It should not happen even if you have multiple triggerers, because each one take a part of the unassigned triggers (base on capacity arguments) and locks the rows in the DB until update them to add its ID. I think your problem is with line: airflow/airflow/jobs/triggerer_job_runner.py Line 685 in a1ba155
where your custom trigger doesn't set the task_instance parameter, and keep it None. |
Will check, but wouldn't all triggers fail if that were the case? There are many (>100) identical triggers working and only one that fails (all using the same custom trigger). I notice that if the trigger heartbeat is delayed it can push the triggers to other triggerers. Could that open the window for the problem above under high load? The custom triggers do one http request using async call and then parse the json response. I don't think they are blocking but sometimes the triggerer complains of blocking when many are running at once. |
I think task instance is set just below on line 691. |
I will check that and try to reproduce your problem |
Thanks! I appreciate your attention on this! I notice that submit_event also breaks the link between trigger and task, so this might leave a slightly larger window. The removal of the link could be left to Trigger.clean_unused? |
Apache Airflow version
2.6.2
What happened
We are running a dag with many deferrable tasks using a custom trigger that waits for an Azure Batch task to complete. When many tasks have been deferred, we can an intermittent error in the Triggerer. The logged error message is the following:
After this error occurs, the trigger still reports as healthy, but no events are triggered. Restarting the triggerer fixes the problem.
What you think should happen instead
The specific error in the trigger should be addressed to prevent the triggerer async thread from crashing.
The triggerer should not perform heartbeat updates when the async triggerer thread has crashed.
How to reproduce
This occurs intermittently, and seems to be the results of running more than one triggerer. Running many deferred tasks eventually ends up with this error occurring.
Operating System
linux (standard airflow slim images extended with custom code running on kubernetes)
Versions of Apache Airflow Providers
postgres,celery,redis,ssh,statsd,papermill,pandas,github_enterprise
Deployment
Official Apache Airflow Helm Chart
Deployment details
Azure Kubernetes and helm chart 1.9.0.
2 replicas of both triggerer and scheduler.
Anything else
It seems that as triggers fire, the link between the trigger row and the associated task_instance for the trigger is removed before the trigger row is removed. This leaves a small amount of time where the trigger exists without an associated task_instance. The database updates are performed in a synchronous loop inside the triggerer, so with one triggerer, this is not a problem. However, it can be a problem with more than one triggerer.
Also, once the triggerer async loop (that handles the trigger code) fails, the triggers no longer fire. However, the heartbeat is handled by the synchronous loop so the job still reports as healthy.
I have included an associated PR to resolve these issues.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: