Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-4883] Kill hung file process managers #5605

Merged
merged 1 commit into from
Jul 19, 2019

Conversation

aoen
Copy link
Contributor

@aoen aoen commented Jul 18, 2019

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-4883
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    Dag processing processes are now timed out externally which is more robust using the DAG parsing timeout.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Updated existing tests. Also tested locally with a DAG that timed out and made sure the process got restarted and the logline appeared. It's a bit hard to make a good unit test for this since sleeping is a a testing smell, and I don't think there is a trivial way to mock this in the Airflow tests.

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

@aoen aoen force-pushed the ddavydov/timeout_dag_parsing branch from dbf50dd to 0a23a4b Compare July 18, 2019 16:29
@aoen
Copy link
Contributor Author

aoen commented Jul 18, 2019

Travis has some strange docker issues, might be due to the recent testing overhaul. I'll merge this myself after I get a LGTM from another committer, but will wait for the CI issues to be resolved first to be safe.

@msumit
Copy link
Contributor

msumit commented Jul 18, 2019

lgtm

@potiuk
Copy link
Member

potiuk commented Jul 18, 2019

This looks like a dockerhub problem: https://status.docker.com/pages/533c6539221ae15e3f000031

Not related to the overhaul :)

@potiuk
Copy link
Member

potiuk commented Jul 18, 2019

I will restart the job when it is resolved .

@ashb
Copy link
Member

ashb commented Jul 18, 2019

Failures are now test failures, not docker issues.

@aoen aoen force-pushed the ddavydov/timeout_dag_parsing branch 2 times, most recently from 383279f to 9aac2e9 Compare July 19, 2019 11:42
@aoen aoen force-pushed the ddavydov/timeout_dag_parsing branch from 9aac2e9 to ff0878c Compare July 19, 2019 12:34
@msumit msumit merged commit d2b35e8 into apache:master Jul 19, 2019
"killing it.",
processor.file_path, processor.pid, processor.start_time.isoformat())
Stats.incr('dag_file_processor_timeouts', 1, 1)
processor.kill()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line seems to cause scheduler to crash. There is no .kill(). Do you mean .terminate() ?

  File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/airflow/airflow/utils/dag_processing.py", line 581, in helper
    processor_manager.start()
  File "/opt/airflow/airflow/utils/dag_processing.py", line 825, in start
    self.start_in_sync()
  File "/opt/airflow/airflow/utils/dag_processing.py", line 891, in start_in_sync
    simple_dags = self.heartbeat()
  File "/opt/airflow/airflow/utils/dag_processing.py", line 1153, in heartbeat
    self._kill_timed_out_processors()
  File "/opt/airflow/airflow/utils/dag_processing.py", line 1259, in _kill_timed_out_processors
    processor.kill()
AttributeError: 'DagFileProcessor' object has no attribute 'kill'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a rebase issue since a lot of the code got refactored, kill is a new method. I'll fix and add an integration test to make sure that this code path is actually run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix is here: #5639

@tooptoop4
Copy link
Contributor

travis retest this

ashb pushed a commit that referenced this pull request Jul 24, 2019
Previous PR (#5605) was missing some code after a rebase. This adds the code
and adds unit tests
@tooptoop4
Copy link
Contributor

@aoen @yuqian90 any chance of getting this in 1.10.4?

@aoen
Copy link
Contributor Author

aoen commented Jul 24, 2019

@tooptoop4 probably a better question for the release manager (should be included along with the rebase fix here: 5d6d029). Not sure this PR warrants disrupting the current release attempt though.

serkef pushed a commit to serkef/airflow that referenced this pull request Jul 24, 2019
Previous PR (apache#5605) was missing some code after a rebase. This adds the code
and adds unit tests
wmorris75 pushed a commit to modmed/incubator-airflow that referenced this pull request Jul 29, 2019
wmorris75 pushed a commit to modmed/incubator-airflow that referenced this pull request Jul 29, 2019
Previous PR (apache#5605) was missing some code after a rebase. This adds the code
and adds unit tests
ashb pushed a commit to ashb/airflow that referenced this pull request Jul 30, 2019
ashb pushed a commit to ashb/airflow that referenced this pull request Jul 30, 2019
Previous PR (apache#5605) was missing some code after a rebase. This adds the code
and adds unit tests

(cherry picked from commit 5d6d029)
@ashb
Copy link
Member

ashb commented Jul 30, 2019

Since we had to restart fix another bug anyway I did pull this (and the fix commits) in to the release branch.

@tooptoop4
Copy link
Contributor

@ashb should https://issues.apache.org/jira/browse/AIRFLOW-4883 jira have fixversion=1.10.4 ?

@ashb
Copy link
Member

ashb commented Jul 31, 2019

@tooptoop4 yeah, I hadn't cone through and updated fix versions for everything I'd cherry picked lately. I have now

ashb added a commit to ashb/airflow that referenced this pull request Aug 1, 2019
PRs apache#5615 and apache#5605 and fought a bit over this change, and this is
hard (but not impossible) to test so we didn't notice.
ashb added a commit that referenced this pull request Aug 1, 2019
PRs #5615 and #5605 and fought a bit over this change, and this is
hard (but not impossible) to test so we didn't notice.
ashb added a commit that referenced this pull request Aug 1, 2019
PRs #5615 and #5605 and fought a bit over this change, and this is
hard (but not impossible) to test so we didn't notice.

(cherry picked from commit 3e3c0cd)
dharamsk pushed a commit to postmates/airflow that referenced this pull request Aug 8, 2019
dharamsk pushed a commit to postmates/airflow that referenced this pull request Aug 8, 2019
Previous PR (apache#5605) was missing some code after a rebase. This adds the code
and adds unit tests

(cherry picked from commit 5d6d029)
dharamsk pushed a commit to postmates/airflow that referenced this pull request Aug 8, 2019
PRs apache#5615 and apache#5605 and fought a bit over this change, and this is
hard (but not impossible) to test so we didn't notice.

(cherry picked from commit 3e3c0cd)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants