-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move Zombie detection to SchedulerJob #21181
Conversation
cc: @potiuk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one looks good (and a good start for AIP-43).
One to add is that we are changing a default value for detection interval from 10 to 30 and there are two questions:
- can it have some unexpected consequences (I think only positive - i.e. less load, 30s is pretty fast for zombie detection IMHO)
- if we agree to change it to 30 s. it should be added to UPDATING.md as a "change" for the upcoming 2.3.0
Since this is change in the core/scheduler I will need second commiter review here. @ashb @uranusjr @ephraimbuddy I think you are the "closest" to scheduler_job and you can think if there are some onforeseen consequences. I thin
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
@@ -89,7 +89,6 @@ class DagBag(LoggingMixin): | |||
""" | |||
|
|||
DAGBAG_IMPORT_TIMEOUT = conf.getfloat('core', 'DAGBAG_IMPORT_TIMEOUT') | |||
SCHEDULER_ZOMBIE_TASK_THRESHOLD = conf.getint('scheduler', 'scheduler_zombie_task_threshold') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I wonder if we can remove this. It’s not used anywhere (even before this PR) and is probably kept for backward compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I didn't find any reference to that in the codebase, so I just removed as unused code.
But if you believe it should be left there, then of course I can revert this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s revert to be safe (and add a comment saying this is not actually used anywhere and can be removed in Airflow 3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be for removing it. There is one other place (local_task_job) where the threshold is used (but directly) so I think this is a leftover from separating this on out. I do not imagine any use of the constant by the users :). Anyone who wants this value should do conf.getint
.
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
Logic makes sense to me although I don’t have much real-world experience to comment on the configuration issue. |
If you believe it would be safer to keep 10s then we may use it. Although it sounds little to often for me. |
I think it's ok to change (I see no serious effects of such change) but note in UPDATING.md will be needed. @ashb - was there a rationale behind the 10s initially ? Or was it just arbitrary chosen without much of a "reasoning" ? |
No idea the source of the 10s -- it was from 2018 in #3873 |
Ah good one. I should have checked before! @KevinYang21 - maybe you remember / have some insights if the 10s were chosen for a good reason? |
If you aren't familiar with "git log pickaxe" it's amazing for things like this http://www.philandstuff.com/2014/02/09/git-pickaxe.html
|
TIL! thanks @ashb ! |
I reverted the interval time to 10 seconds to make it fully backward compatible. |
f198e4b
to
24d3697
Compare
Moves zombie detection method from DagProcessor to SchedulerJob.
More context:
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-43+DAG+Processor+separation