-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle partially updated trigger/task_instance data in triggerr jobs #32092
Handle partially updated trigger/task_instance data in triggerr jobs #32092
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add new unit tests for these two changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the case where the task_instance is None, I think this could be fixed/improved in a separate PR, and for the blocked triggers, I think we don't need to continue the loop if the thread is dead (regardless the reason), and also we can add an try/catch to stop the Triggerer when it fails:
diff --git a/airflow/jobs/triggerer_job_runner.py b/airflow/jobs/triggerer_job_runner.py
index 79b5e953dc..7b2670340c 100644
--- a/airflow/jobs/triggerer_job_runner.py
+++ b/airflow/jobs/triggerer_job_runner.py
@@ -351,6 +351,9 @@ class TriggererJobRunner(BaseJobRunner["Job | JobPydantic"], LoggingMixin):
This runs synchronously and handles all database reads/writes.
"""
while not self.trigger_runner.stop:
+ if not self.trigger_runner.is_alive():
+ self.log.error("Trigger runner thread has died! Exiting.")
+ break
# Clean out unused triggers
Trigger.clean_unused()
# Load/delete triggers
@@ -467,17 +470,21 @@ class TriggerRunner(threading.Thread, LoggingMixin):
watchdog = asyncio.create_task(self.block_watchdog())
last_status = time.time()
while not self.stop:
- # Run core logic
- await self.create_triggers()
- await self.cancel_triggers()
- await self.cleanup_finished_triggers()
- # Sleep for a bit
- await asyncio.sleep(1)
- # Every minute, log status
- if time.time() - last_status >= 60:
- count = len(self.triggers)
- self.log.info("%i triggers currently running", count)
- last_status = time.time()
+ try:
+ # Run core logic
+ await self.create_triggers()
+ await self.cancel_triggers()
+ await self.cleanup_finished_triggers()
+ # Sleep for a bit
+ await asyncio.sleep(1)
+ # Every minute, log status
+ if time.time() - last_status >= 60:
+ count = len(self.triggers)
+ self.log.info("%i triggers currently running", count)
+ last_status = time.time()
+ except Exception:
+ self.stop = True
+ raise
# Wait for watchdog to complete
await watchdog
WDYT?
LGTM. A more complete approach to any such errors. Only change I would suggest would be to log any exceptions raised in async thread. Either in the async thread try block, or should we perhaps await the async thread after it dies to pick up any exceptions re-raised? |
Ah, sorry. I guess they're logged when the async thread dies. All good. |
17fe372
to
038e19f
Compare
038e19f
to
39296b7
Compare
I've updated the PR as per suggestions, added a test, and rebased to main. Could I request a review? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM . @hussein-awala ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions. |
@tomrutter Congrats on your first merged PR! |
…2092) Co-authored-by: tom.rutter <[email protected]> (cherry picked from commit e585b58)
closes: #32091
Updates to triggerer job to improve robustness.
Context:
Triggerer runs two threads, one synchronous and one asynchronous. The synchronous thread handles db updates, and the asynchronous thread handles custom trigger code. The heartbeat update for the job happens in the synchronous thread.
Details (two updates):
Added a check for missing associated task_instance when adding a trigger. This prevents a crash in the triggerer async thread when updating with partially updated data. The trigger concerned is ignored, and will later deleted by Trigger.clean_unused() in the synchronous thread.
Added a check that the async thread is still running before updating triggerer heartbeat. This results in the triggerer being reported as unhealthy (and subsequently restarted) when the async thread crashes.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.