-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-4883] Kill hung file process managers #5605
Conversation
dbf50dd
to
0a23a4b
Compare
Travis has some strange docker issues, might be due to the recent testing overhaul. I'll merge this myself after I get a LGTM from another committer, but will wait for the CI issues to be resolved first to be safe. |
lgtm |
This looks like a dockerhub problem: https://status.docker.com/pages/533c6539221ae15e3f000031 Not related to the overhaul :) |
I will restart the job when it is resolved . |
Failures are now test failures, not docker issues. |
383279f
to
9aac2e9
Compare
9aac2e9
to
ff0878c
Compare
"killing it.", | ||
processor.file_path, processor.pid, processor.start_time.isoformat()) | ||
Stats.incr('dag_file_processor_timeouts', 1, 1) | ||
processor.kill() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line seems to cause scheduler to crash. There is no .kill()
. Do you mean .terminate()
?
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/airflow/airflow/utils/dag_processing.py", line 581, in helper
processor_manager.start()
File "/opt/airflow/airflow/utils/dag_processing.py", line 825, in start
self.start_in_sync()
File "/opt/airflow/airflow/utils/dag_processing.py", line 891, in start_in_sync
simple_dags = self.heartbeat()
File "/opt/airflow/airflow/utils/dag_processing.py", line 1153, in heartbeat
self._kill_timed_out_processors()
File "/opt/airflow/airflow/utils/dag_processing.py", line 1259, in _kill_timed_out_processors
processor.kill()
AttributeError: 'DagFileProcessor' object has no attribute 'kill'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a rebase issue since a lot of the code got refactored, kill is a new method. I'll fix and add an integration test to make sure that this code path is actually run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix is here: #5639
travis retest this |
Previous PR (#5605) was missing some code after a rebase. This adds the code and adds unit tests
@tooptoop4 probably a better question for the release manager (should be included along with the rebase fix here: 5d6d029). Not sure this PR warrants disrupting the current release attempt though. |
Previous PR (apache#5605) was missing some code after a rebase. This adds the code and adds unit tests
Previous PR (apache#5605) was missing some code after a rebase. This adds the code and adds unit tests
(cherry picked from commit d2b35e8)
Previous PR (apache#5605) was missing some code after a rebase. This adds the code and adds unit tests (cherry picked from commit 5d6d029)
Since we had to restart fix another bug anyway I did pull this (and the fix commits) in to the release branch. |
@ashb should https://issues.apache.org/jira/browse/AIRFLOW-4883 jira have fixversion=1.10.4 ? |
@tooptoop4 yeah, I hadn't cone through and updated fix versions for everything I'd cherry picked lately. I have now |
PRs apache#5615 and apache#5605 and fought a bit over this change, and this is hard (but not impossible) to test so we didn't notice.
(cherry picked from commit d2b35e8)
Previous PR (apache#5605) was missing some code after a rebase. This adds the code and adds unit tests (cherry picked from commit 5d6d029)
PRs apache#5615 and apache#5605 and fought a bit over this change, and this is hard (but not impossible) to test so we didn't notice. (cherry picked from commit 3e3c0cd)
Make sure you have checked all steps below.
Jira
Description
Dag processing processes are now timed out externally which is more robust using the DAG parsing timeout.
Tests
Updated existing tests. Also tested locally with a DAG that timed out and made sure the process got restarted and the logline appeared. It's a bit hard to make a good unit test for this since sleeping is a a testing smell, and I don't think there is a trivial way to mock this in the Airflow tests.
Commits
Documentation
Code Quality
flake8