Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry logic in the scheduler for updating trigger timeouts in case of deadlocks. #41429

Merged

Conversation

TakawaAkirayo
Copy link
Contributor

@TakawaAkirayo TakawaAkirayo commented Aug 13, 2024

related: #41428

The scheduler job raise exception on database dead lock and exist.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/airflow/jobs/scheduler_job_runner.py", line 845, in _execute
    self._run_scheduler_loop()
  File "/usr/local/lib/python3.8/dist-packages/airflow/jobs/scheduler_job_runner.py", line 991, in _run_scheduler_loop
    next_event = timers.run(blocking=False)
  File "/usr/lib/python3.8/sched.py", line 151, in run
    action(*argument, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/airflow/utils/event_scheduler.py", line 37, in repeat
    action(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/airflow/utils/session.py", line 77, in wrapper
    return func(*args, session=session, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/airflow/jobs/scheduler_job_runner.py", line 1680, in check_trigger_timeouts
    num_timed_out_tasks = session.execute(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/session.py", line 1717, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1710, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1577, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1953, in _execute_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 2134, in _handle_dbapi_exception
    util.raise_(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/local/lib/python3.8/dist-packages/MySQLdb/cursors.py", line 179, in execute
    res = self._query(mogrified_query)
  File "/usr/local/lib/python3.8/dist-packages/MySQLdb/cursors.py", line 330, in _query
    db.query(q)
  File "/usr/local/lib/python3.8/dist-packages/MySQLdb/connections.py", line 255, in query
    _mysql.connection.query(self, query)
sqlalchemy.exc.OperationalError: (MySQLdb.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction')
[SQL: UPDATE task_instance SET state=%s, updated_at=%s, trigger_id=%s, next_method=%s, next_kwargs=%s WHERE task_instance.state = %s AND task_instance.trigger_timeout < %s]
[parameters: (<TaskInstanceState.SCHEDULED: 'scheduled'>, datetime.datetime(2024, 8, 2, 13, 14, 22, 215659), None, '__fail__', '{"__var": {"error": "Trigger/execution timeout"}, "__type": "dict"}', <TaskInstanceState.DEFERRED: 'deferred'>, datetime.datetime(2024, 8, 2, 13, 14, 22, 202306, tzinfo=Timezone('UTC')))]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
[�[34m2024-08-02T06:14:22.258-0700�[0m] {�[34mkubernetes_executor.py:�[0m706} INFO�[0m - Shutting down Kubernetes executor�[0m

This should occur when the scheduler and trigger compete for a row lock, based on MySQL database query log analysis. Since the trigger already includes a retry mechanism on update(Trigger.clean_unused), we should add a retry mechanism here as well.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Aug 13, 2024
@TakawaAkirayo
Copy link
Contributor Author

TakawaAkirayo commented Sep 10, 2024

From our observations in production, the deadlock issue has not occurred since we applied our own patch to Airflow. It seems that retry can tolerate the issue to some extent. However, to completely eliminate it, processes need to maintain the same data access order. There is still room for further optimization.

Please kindly review this when you have time if this is the right fix, and if you have any suggestions, please let me know @kaxil @ashb @XD-DENG @shahar1

Copy link
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay, I accidentally overlooked this PR after your fixes - it looks OK, I'd be happy for a second opinion though.
@TakawaAkirayo Could you please add a newsfragment?

@shahar1 shahar1 added this to the Airflow 2.10.3 milestone Sep 29, 2024
@TakawaAkirayo
Copy link
Contributor Author

Apologies for the delay, I accidentally overlooked this PR after your fixes - it looks OK, I'd be happy for a second opinion though. @TakawaAkirayo Could you please add a newsfragment?

@shahar1 Many thanks for the review. I just added a newsfragment regarding this https://github.com/apache/airflow/blob/main/contributing-docs/16_contribution_workflow.rst. please have a check.

Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but static checks need to be fixed.

@TakawaAkirayo
Copy link
Contributor Author

Approved, but static checks need to be fixed.

@jscheffl Sure, I've already fixed the static check, and the checks have passed now.

@shahar1 shahar1 merged commit 00589cf into apache:main Oct 2, 2024
6 checks passed
Copy link

boring-cyborg bot commented Oct 2, 2024

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

shahar1 pushed a commit to shahar1/airflow that referenced this pull request Oct 2, 2024
…e of deadlocks. (apache#41429)

* Add retry in update trigger timeout

* add ut for these cases

* use OperationalError in ut to describe deadlock scenarios

* [MINOR] add newsfragment for this PR

* [MINOR] refactor UT for mypy check

(cherry picked from commit 00589cf)
shahar1 added a commit that referenced this pull request Oct 2, 2024
…outs in case of deadlocks. (#41429) (#42651)

* Add retry logic in the scheduler for updating trigger timeouts in case of deadlocks. (#41429)

* Add retry in update trigger timeout

* add ut for these cases

* use OperationalError in ut to describe deadlock scenarios

* [MINOR] add newsfragment for this PR

* [MINOR] refactor UT for mypy check

(cherry picked from commit 00589cf)

* Fix type-ignore comment for typing changes (#42656)

---------

Co-authored-by: TakawaAkirayo <[email protected]>
Co-authored-by: Tzu-ping Chung <[email protected]>
joaopamaral pushed a commit to joaopamaral/airflow that referenced this pull request Oct 21, 2024
…e of deadlocks. (apache#41429)

* Add retry in update trigger timeout

* add ut for these cases

* use OperationalError in ut to describe deadlock scenarios

* [MINOR] add newsfragment for this PR

* [MINOR] refactor UT for mypy check
@utkarsharma2 utkarsharma2 added the type:bug-fix Changelog: Bug Fixes label Oct 23, 2024
utkarsharma2 pushed a commit that referenced this pull request Oct 23, 2024
…outs in case of deadlocks. (#41429) (#42651)

* Add retry logic in the scheduler for updating trigger timeouts in case of deadlocks. (#41429)

* Add retry in update trigger timeout

* add ut for these cases

* use OperationalError in ut to describe deadlock scenarios

* [MINOR] add newsfragment for this PR

* [MINOR] refactor UT for mypy check

(cherry picked from commit 00589cf)

* Fix type-ignore comment for typing changes (#42656)

---------

Co-authored-by: TakawaAkirayo <[email protected]>
Co-authored-by: Tzu-ping Chung <[email protected]>
utkarsharma2 pushed a commit that referenced this pull request Oct 24, 2024
…outs in case of deadlocks. (#41429) (#42651)

* Add retry logic in the scheduler for updating trigger timeouts in case of deadlocks. (#41429)

* Add retry in update trigger timeout

* add ut for these cases

* use OperationalError in ut to describe deadlock scenarios

* [MINOR] add newsfragment for this PR

* [MINOR] refactor UT for mypy check

(cherry picked from commit 00589cf)

* Fix type-ignore comment for typing changes (#42656)

---------

Co-authored-by: TakawaAkirayo <[email protected]>
Co-authored-by: Tzu-ping Chung <[email protected]>
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
…e of deadlocks. (apache#41429)

* Add retry in update trigger timeout

* add ut for these cases

* use OperationalError in ut to describe deadlock scenarios

* [MINOR] add newsfragment for this PR

* [MINOR] refactor UT for mypy check
@klacire
Copy link

klacire commented Nov 19, 2024

After upgrading from 2.9.0 to 2.10.3 the deadlock still exists. It's even worst, now it makes the sensor (in reschedule mode) fail after a random number of poke.

@potiuk
Copy link
Member

potiuk commented Nov 19, 2024

After upgrading from 2.9.0 to 2.10.3 the deadlock still exists. It's even worst, now it makes the sensor (in reschedule mode) fail after a random number of poke.

Please open new issue and provide all information about your case then and add reference to it as "similar to #41429". It might be the same or different issue manifesting tne same way and the more information you provide, the higher chance someone will attempt to look at it and try to diagnose and fix it.

When you just comment on a closed issue "thse issue is still not solved" with very vague description and without details explaining what you mean, the chances that someone will look at it are very slim, almost none. You increase your chances by creating new issue with as detailed explanation of your circumstances as possible. Up to you if you want to increase your chances of getting help.

@TakawaAkirayo
Copy link
Contributor Author

@klacire This change primarily attempts to tolerate the scheduler's failure caused by the deadlock issue, rather than completely eliminating the deadlock. You can refer to the previous comments; we still have work to do to eliminate the deadlock entirely.

Could you please open new issue and provide your error stack trace? Currently I don't see much of a definitive connection between your issue and this change.

@klacire
Copy link

klacire commented Nov 20, 2024

I already can find an issue mentioning the same problem : #41428

@TakawaAkirayo
Copy link
Contributor Author

@klacire Ok, what about this issue:
'now it makes the sensor (in reschedule mode) fail after a random number of poke'
What's the stack trace of it? What's the direct reason of the sensor's faliure?

@klacire
Copy link

klacire commented Nov 21, 2024

same root cause in my opinion. No need to duplicate for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants