Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Zombie detection to SchedulerJob #21181

Merged
merged 2 commits into from
Feb 10, 2022
Merged

Move Zombie detection to SchedulerJob #21181

merged 2 commits into from
Feb 10, 2022

Conversation

mhenc
Copy link
Contributor

@mhenc mhenc commented Jan 28, 2022

Moves zombie detection method from DagProcessor to SchedulerJob.

More context:
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-43+DAG+Processor+separation

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jan 28, 2022
@mhenc mhenc marked this pull request as ready for review January 28, 2022 11:12
@mhenc
Copy link
Contributor Author

mhenc commented Jan 28, 2022

cc: @potiuk

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks good (and a good start for AIP-43).

One to add is that we are changing a default value for detection interval from 10 to 30 and there are two questions:

  • can it have some unexpected consequences (I think only positive - i.e. less load, 30s is pretty fast for zombie detection IMHO)
  • if we agree to change it to 30 s. it should be added to UPDATING.md as a "change" for the upcoming 2.3.0

Since this is change in the core/scheduler I will need second commiter review here. @ashb @uranusjr @ephraimbuddy I think you are the "closest" to scheduler_job and you can think if there are some onforeseen consequences. I thin

@github-actions
Copy link

github-actions bot commented Feb 1, 2022

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Feb 1, 2022
@potiuk potiuk added AIP-43 DAG processor separation AIP-43 and removed full tests needed We need to run full set of tests for this PR to merge labels Feb 1, 2022
@@ -89,7 +89,6 @@ class DagBag(LoggingMixin):
"""

DAGBAG_IMPORT_TIMEOUT = conf.getfloat('core', 'DAGBAG_IMPORT_TIMEOUT')
SCHEDULER_ZOMBIE_TASK_THRESHOLD = conf.getint('scheduler', 'scheduler_zombie_task_threshold')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wonder if we can remove this. It’s not used anywhere (even before this PR) and is probably kept for backward compatibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I didn't find any reference to that in the codebase, so I just removed as unused code.

But if you believe it should be left there, then of course I can revert this change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s revert to be safe (and add a comment saying this is not actually used anywhere and can be removed in Airflow 3).

Copy link
Member

@potiuk potiuk Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be for removing it. There is one other place (local_task_job) where the threshold is used (but directly) so I think this is a leftover from separating this on out. I do not imagine any use of the constant by the users :). Anyone who wants this value should do conf.getint.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Feb 3, 2022
@github-actions
Copy link

github-actions bot commented Feb 3, 2022

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@uranusjr
Copy link
Member

uranusjr commented Feb 3, 2022

Logic makes sense to me although I don’t have much real-world experience to comment on the configuration issue.

@mhenc
Copy link
Contributor Author

mhenc commented Feb 3, 2022

If you believe it would be safer to keep 10s then we may use it. Although it sounds little to often for me.
Nonetheless it's now a configuration option, so users are able to set it to the value they like.

@potiuk
Copy link
Member

potiuk commented Feb 3, 2022

If you believe it would be safer to keep 10s then we may use it. Although it sounds little to often for me. Nonetheless it's now a configuration option, so users are able to set it to the value they like.

I think it's ok to change (I see no serious effects of such change) but note in UPDATING.md will be needed. @ashb - was there a rationale behind the 10s initially ? Or was it just arbitrary chosen without much of a "reasoning" ?

@ashb
Copy link
Member

ashb commented Feb 3, 2022

No idea the source of the 10s -- it was from 2018 in #3873

@potiuk
Copy link
Member

potiuk commented Feb 3, 2022

No idea the source of the 10s -- it was from 2018 in #3873

Ah good one. I should have checked before! @KevinYang21 - maybe you remember / have some insights if the 10s were chosen for a good reason?

@ashb
Copy link
Member

ashb commented Feb 3, 2022

If you aren't familiar with "git log pickaxe" it's amazing for things like this http://www.philandstuff.com/2014/02/09/git-pickaxe.html

airflow ❯ git log -S _zombie_query_interval --oneline 
897960736 Revert "[AIRFLOW-4797] Improve performance and behaviour of zombie detection (#5511)"
2bdb053db [AIRFLOW-4797] Improve performance and behaviour of zombie detection (#5511)
75e2288a3 [Airflow-2760] Decouple DAG parsing loop from scheduler loop (#3873)

@potiuk
Copy link
Member

potiuk commented Feb 3, 2022

TIL! thanks @ashb !

@mhenc
Copy link
Contributor Author

mhenc commented Feb 4, 2022

I reverted the interval time to 10 seconds to make it fully backward compatible.
As it's configuration option now, then user may decide to bump up it to save some CPU/DB resources.

@uranusjr uranusjr changed the title Movie Zombie detection to SchedulerJob Move Zombie detection to SchedulerJob Feb 4, 2022
@mhenc mhenc force-pushed the move_zombie branch 2 times, most recently from f198e4b to 24d3697 Compare February 9, 2022 14:39
airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved
@potiuk potiuk merged commit 0abee18 into apache:main Feb 10, 2022
ferruzzi pushed a commit to ferruzzi/airflow that referenced this pull request Feb 11, 2022
@jedcunningham jedcunningham added the type:improvement Changelog: Improvements label Feb 28, 2022
@mhenc mhenc deleted the move_zombie branch March 8, 2022 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AIP-43 DAG processor separation AIP-43 area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge type:improvement Changelog: Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants