Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler job exit on database dead lock #41428

Open
2 tasks done
TakawaAkirayo opened this issue Aug 13, 2024 · 0 comments
Open
2 tasks done

Scheduler job exit on database dead lock #41428

TakawaAkirayo opened this issue Aug 13, 2024 · 0 comments
Assignees
Labels
affected_version:2.7 Issues Reported for 2.7 area:core area:MetaDB Meta Database related issues. area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug

Comments

@TakawaAkirayo
Copy link
Contributor

TakawaAkirayo commented Aug 13, 2024

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.7.2

What happened?

The scheduler job raise exception on database dead lock and exist.

Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/airflow/jobs/scheduler_job_runner.py", line 845, in _execute
self._run_scheduler_loop()
File "/usr/local/lib/python3.8/dist-packages/airflow/jobs/scheduler_job_runner.py", line 991, in _run_scheduler_loop
next_event = timers.run(blocking=False)
File "/usr/lib/python3.8/sched.py", line 151, in run
action(*argument, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/airflow/utils/event_scheduler.py", line 37, in repeat
action(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/airflow/utils/session.py", line 77, in wrapper
return func(*args, session=session, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/airflow/jobs/scheduler_job_runner.py", line 1680, in check_trigger_timeouts
num_timed_out_tasks = session.execute(
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/session.py", line 1717, in execute
result = conn._execute_20(statement, params or {}, execution_options)
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1710, in _execute_20
return meth(self, args_10style, kwargs_10style, execution_options)
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
return connection._execute_clauseelement(
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1577, in _execute_clauseelement
ret = self._execute_context(
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1953, in _execute_context
self.handle_dbapi_exception(
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 2134, in handle_dbapi_exception
util.raise
(
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/compat.py", line 211, in raise

raise exception
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
self.dialect.do_execute(
File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 736, in do_execute
cursor.execute(statement, parameters)
File "/usr/local/lib/python3.8/dist-packages/MySQLdb/cursors.py", line 179, in execute
res = self._query(mogrified_query)
File "/usr/local/lib/python3.8/dist-packages/MySQLdb/cursors.py", line 330, in _query
db.query(q)
File "/usr/local/lib/python3.8/dist-packages/MySQLdb/connections.py", line 255, in query
_mysql.connection.query(self, query)
sqlalchemy.exc.OperationalError: (MySQLdb.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction')
[SQL: UPDATE task_instance SET state=%s, updated_at=%s, trigger_id=%s, next_method=%s, next_kwargs=%s WHERE task_instance.state = %s AND task_instance.trigger_timeout < %s]
[parameters: (<TaskInstanceState.SCHEDULED: 'scheduled'>, datetime.datetime(2024, 8, 2, 13, 14, 22, 215659), None, 'fail', '{"__var": {"error": "Trigger/execution timeout"}, "__type": "dict"}', <TaskInstanceState.DEFERRED: 'deferred'>, datetime.datetime(2024, 8, 2, 13, 14, 22, 202306, tzinfo=Timezone('UTC')))]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
[�[34m2024-08-02T06:14:22.258-0700�[0m] {�[34mkubernetes_executor.py:�[0m706} INFO�[0m - Shutting down Kubernetes executor�[0m

What you think should happen instead?

Retry should be applied to this deadlock, as the trigger already includes a retry on update. Both should handle the same scenario when updating the task_instance associated with the trigger.

How to reproduce

  1. Set scheduler instance number >= 2, Triggerer instance number >= 2
  2. Trigger 20 DAG runs, each with at least one triggered job with runtime > 1 mins

Operating System

Ubuntu

Versions of Apache Airflow Providers

No response

Deployment

Other

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@TakawaAkirayo TakawaAkirayo added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Aug 13, 2024
@dosubot dosubot bot added area:Scheduler including HA (high availability) scheduler deadlock labels Aug 13, 2024
@eladkal eladkal added area:MetaDB Meta Database related issues. affected_version:2.7 Issues Reported for 2.7 and removed needs-triage label for new issues that we didn't triage yet deadlock labels Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.7 Issues Reported for 2.7 area:core area:MetaDB Meta Database related issues. area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

2 participants