Avoid littering postgres server logs with "could not obtain lock" with HA schedulers #19842

ashb · 2021-11-26T14:35:10Z

If you are running multiple schedulers on PostgreSQL, it is likely that sooner or later you will have one scheduler fail the race to enter the critical section (which is fine, and expected).

However this can end up spamming the DB logs with errors like this:

Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] ERROR:  could not obtain lock on row in relation "slot_pool"
Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] STATEMENT:  SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots
Nov 26 14:08:48 sinope postgres[709953]:         FROM slot_pool FOR UPDATE NOWAIT
Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] ERROR:  could not obtain lock on row in relation "slot_pool"
Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] STATEMENT:  SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots
Nov 26 14:08:49 sinope postgres[709954]:         FROM slot_pool FOR UPDATE NOWAIT

If you are really unlucky that can end up happening over and over and over again.

So to avoid this error, for PostgreSQL only, we first try to acquire an "advisory lock" (advisory because it's up to the application to respect it), and if we cannot raise an error like would have happened from the FOR UPDATE NOWAIT.

(We still obtain the exclusive log on the pool rows so that the rows are locked.)

This PR is split in to two commits, the second obtains the lock, and the first commit refactors the existing global locks to use enums to remove magic constants as these (integer for postgres) lock ids are "global", so we need to be sure the scheduler's lock doesn't clash with the db upgrade lock.

And `create_global_lock` isn't anything to do with session, so I have moved it to utils.db instead. And as part of this I changed the lock id that `airflow db reset` uses to share one with `airflow db upgrade` -- there's no point blocking upgrade if a reset is going to clobber it halfway through.

ashb · 2021-11-26T14:35:24Z

/cc @wolfier @collinmcnulty

ashb · 2021-11-26T14:43:54Z

This could probably go in to 2.2.x, but I've marked it as 2.3 as I haven't tested this extensively yet.

If you are running multiple schedulers on PostgreSQL, it is likely that sooner or later you will have one scheduler fail the race to enter the critical section (which is fine, and expected). However this can end up spamming the DB logs with errors like this: ``` Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] ERROR: could not obtain lock on row in relation "slot_pool" Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] STATEMENT: SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots Nov 26 14:08:48 sinope postgres[709953]: FROM slot_pool FOR UPDATE NOWAIT Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] ERROR: could not obtain lock on row in relation "slot_pool" Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] STATEMENT: SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots Nov 26 14:08:49 sinope postgres[709954]: FROM slot_pool FOR UPDATE NOWAIT ``` If you are really unlucky that can end up happening over and over and over again. So to avoid this error, for PostgreSQL only, we first try to acquire an "advisory lock" (advisory because it's up to the application to respect it), and if we cannot raise an error _like_ would have happened from the `FOR UPDATE NOWAIT`. (We still obtain the exclusive log on the pool rows so that the rows are locked.)

potiuk · 2021-11-26T14:50:51Z

Oh nice. Will take a close look later :)

airflow/utils/db.py

airflow/jobs/scheduler_job.py

Co-authored-by: Kaxil Naik <[email protected]>

They do conceptually very similar things and when one is running the other shouldn't either. (This is unlikely to ever be hit in practice)

github-actions · 2021-11-29T13:10:56Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

kaxil · 2021-11-29T13:11:03Z

Hmm tests are failing:

 INFO  [alembic.runtime.migration] Running upgrade 7b2661a43ba3 -> be2bfac3da23, Add has_import_errors column to DagModel
  WARNI [airflow.models.crypto] empty cryptography key - values will not be stored encrypted.
  Traceback (most recent call last):
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
      cursor, statement, parameters, context
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
      cursor.execute(statement, parameters)
  psycopg2.errors.LockNotAvailable: canceling statement due to lock timeout
  
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "/opt/airflow/airflow/utils/db.py", line 1017, in create_global_lock
      conn.execute(text('SELECT pg_advisory_lock(:id)'), id=lock.value)
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1011, in execute
      return meth(self, multiparams, params)
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
      return connection._execute_clauseelement(self, multiparams, params)
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1130, in _execute_clauseelement
      distilled_params,
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1317, in _execute_context
      e, statement, parameters, cursor, context
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1511, in _handle_dbapi_exception
      sqlalchemy_exception, with_traceback=exc_info[2], from_=e
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
      raise exception
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
      cursor, statement, parameters, context
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
      cursor.execute(statement, parameters)
  sqlalchemy.exc.OperationalError: (psycopg2.errors.LockNotAvailable) canceling statement due to lock timeout
  
  [SQL: SELECT pg_advisory_lock(%(id)s)]
  [parameters: {'id': 1}]
  (Background on this error at: http://sqlalche.me/e/13/e3q8)
  
  During handling of the above exception, another exception occurred:
  
  Traceback (most recent call last):
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
      cursor, statement, parameters, context
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
      cursor.execute(statement, parameters)
  psycopg2.errors.InFailedSqlTransaction: current transaction is aborted, commands ignored until end of transaction block
  
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "/usr/local/bin/airflow", line 33, in <module>
      sys.exit(load_entry_point('apache-airflow', 'console_scripts', 'airflow')())
    File "/opt/airflow/airflow/__main__.py", line 48, in main
      args.func(args)
    File "/opt/airflow/airflow/cli/cli_parser.py", line 48, in command
      return func(*args, **kwargs)
    File "/opt/airflow/airflow/cli/commands/db_command.py", line 39, in resetdb
      db.resetdb()
    File "/opt/airflow/airflow/utils/session.py", line 69, in wrapper
      return func(*args, session=session, **kwargs)
    File "/opt/airflow/airflow/utils/db.py", line 932, in resetdb
      initdb(session=session)
    File "/opt/airflow/airflow/utils/session.py", line 66, in wrapper
      return func(*args, **kwargs)
    File "/opt/airflow/airflow/utils/db.py", line 599, in initdb
      with create_global_lock(session=session, lock=DBLocks.MIGRATIONS):
    File "/usr/local/lib/python3.7/contextlib.py", line 112, in __enter__
      return next(self.gen)
    File "/opt/airflow/airflow/utils/db.py", line 1030, in create_global_lock
      conn.execute('SET LOCK_TIMEOUT TO DEFAULT')
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1003, in execute
      return self._execute_text(object_, multiparams, params)
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1178, in _execute_text
      parameters,
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1317, in _execute_context
      e, statement, parameters, cursor, context
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1511, in _handle_dbapi_exception
      sqlalchemy_exception, with_traceback=exc_info[2], from_=e
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
      raise exception
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
      cursor, statement, parameters, context
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
      cursor.execute(statement, parameters)
  sqlalchemy.exc.InternalError: (psycopg2.errors.InFailedSqlTransaction) current transaction is aborted, commands ignored until end of transaction block
  
  [SQL: SET LOCK_TIMEOUT TO DEFAULT]
  (Background on this error at: http://sqlalche.me/e/13/2j85)

failing tests

ashb · 2021-11-29T13:13:15Z

Interesting.... it was passing without changing the lock. Let me think.

airflow/utils/db.py

Co-authored-by: Tzu-ping Chung <[email protected]>

ashb · 2021-11-29T15:18:55Z

Fixed in 3b0ec62 -- the connection we issued the lock from was closed when the migrations ran.

The fix there is to use the same connection so that when the migrations code is finished with it, it doesn't get closed, so we can still lock it.

I suspect this might have been the problem with the MSSQL locking too -- it's just that on Postgres we never noticed as it was only a warning and the return value was ignored!

uranusjr

MySQL and Postgres tests are all passing, so this should be good to go.

…h HA schedulers (apache#19842) * Remove magic constants from global DB locks And `create_global_lock` isn't anything to do with session, so I have moved it to utils.db instead. And as part of this I changed the lock id that `airflow db reset` uses to share one with `airflow db upgrade` -- there's no point blocking upgrade if a reset is going to clobber it halfway through. * Avoid littering Postgres server logs with "could not obtain lock" If you are running multiple schedulers on PostgreSQL, it is likely that sooner or later you will have one scheduler fail the race to enter the critical section (which is fine, and expected). However this can end up spamming the DB logs with errors like this: ``` Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] ERROR: could not obtain lock on row in relation "slot_pool" Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] STATEMENT: SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots Nov 26 14:08:48 sinope postgres[709953]: FROM slot_pool FOR UPDATE NOWAIT Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] ERROR: could not obtain lock on row in relation "slot_pool" Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] STATEMENT: SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots Nov 26 14:08:49 sinope postgres[709954]: FROM slot_pool FOR UPDATE NOWAIT ``` If you are really unlucky that can end up happening over and over and over again. So to avoid this error, for PostgreSQL only, we first try to acquire an "advisory lock" (advisory because it's up to the application to respect it), and if we cannot raise an error _like_ would have happened from the `FOR UPDATE NOWAIT`. (We still obtain the exclusive log on the pool rows so that the rows are locked.) * Use same db lock for initdb and upgradedb They do conceptually very similar things and when one is running the other shouldn't either. (This is unlikely to ever be hit in practice) Co-authored-by: Kaxil Naik <[email protected]> Co-authored-by: Tzu-ping Chung <[email protected]>

AhmadBenMaallem · 2022-01-18T10:32:38Z

Has this fix been integrated into an official docker image ?

ashb · 2022-01-18T11:00:57Z

No, as this hasn't been included in a release yet -- it's slated for 2.3

And while it might be annoying, seeing these in the DB in the logs is not an error as far as Airflow is concerned.

fbertos · 2024-05-20T09:56:05Z

Dear all,

Looks like this is happening again as we are receiving the following messages after upgrading to version Airflow 2.9.1 (we are using PostgreSQL 15.7.0):

2024-05-20 09:51:48.336 GMT [122] ERROR: could not obtain lock on row in relation "dag_run"
2024-05-20 09:51:48.336 GMT [122] STATEMENT: SELECT dag_run.state AS dag_run_state, dag_run.id AS dag_run_id, dag_run.dag_id AS dag_run_dag_id, dag_run.queued_at AS dag_run_queued_at, dag_run.execution_date AS dag_run_execution_date, dag_run.start_date AS dag_run_start_date, dag_run.end_date AS dag_run_end_date, dag_run.run_id AS dag_run_run_id, dag_run.creating_job_id AS dag_run_creating_job_id, dag_run.external_trigger AS dag_run_external_trigger, dag_run.run_type AS dag_run_run_type, dag_run.conf AS dag_run_conf, dag_run.data_interval_start AS dag_run_data_interval_start, dag_run.data_interval_end AS dag_run_data_interval_end, dag_run.last_scheduling_decision AS dag_run_last_scheduling_decision, dag_run.dag_hash AS dag_run_dag_hash, dag_run.log_template_id AS dag_run_log_template_id, dag_run.updated_at AS dag_run_updated_at, dag_run.clear_number AS dag_run_clear_number
FROM dag_run
WHERE dag_run.dag_id = '******' AND dag_run.run_id = 'scheduled__2024-05-20T09:47:00+00:00' FOR UPDATE NOWAIT

Could anyone confirm if this issue is back for anybody else?

Thank you in advance!!!

deepaksood619 · 2024-05-28T19:45:01Z

Yes this is happening again for us too.

Airflow version - Airflow 2.9.1

Postgres Version - PostgreSQL 11.22 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit

Currently we are running 3 schedulers parallely

airflow-postgresql-0 postgresql 2024-05-29T01:12:13.163792559+05:30 2024-05-28 19:42:13.163 GMT [4019479] ERROR:  could not obtain lock on row in relation "dag_run"

airflow-postgresql-0 postgresql 2024-05-29T01:12:13.163834190+05:30 2024-05-28 19:42:13.163 GMT [4019479] STATEMENT:  SELECT dag_run.state AS dag_run_state, dag_run.id AS dag_run_id, dag_run.dag_id AS dag_run_dag_id, dag_run.queued_at AS dag_run_queued_at, dag_run.execution_date AS dag_run_execution_date, dag_run.start_date AS dag_run_start_date, dag_run.end_date AS dag_run_end_date, dag_run.run_id AS dag_run_run_id, dag_run.creating_job_id AS dag_run_creating_job_id, dag_run.external_trigger AS dag_run_external_trigger, dag_run.run_type AS dag_run_run_type, dag_run.conf AS dag_run_conf, dag_run.data_interval_start AS dag_run_data_interval_start, dag_run.data_interval_end AS dag_run_data_interval_end, dag_run.last_scheduling_decision AS dag_run_last_scheduling_decision, dag_run.dag_hash AS dag_run_dag_hash, dag_run.log_template_id AS dag_run_log_template_id, dag_run.updated_at AS dag_run_updated_at, dag_run.clear_number AS dag_run_clear_number

airflow-postgresql-0 postgresql 2024-05-29T01:12:13.163839590+05:30 	FROM dag_run

airflow-postgresql-0 postgresql 2024-05-29T01:12:13.163844440+05:30 	WHERE dag_run.dag_id = 'xxx' AND dag_run.run_id = 'scheduled__2024-05-28T19:41:00+00:00' FOR UPDATE NOWAIT

This is also being resolved automatically in some time, and coming intermittently.

Because of these our tasks are getting into queued slot and not able to be run in worker.

We have enough worker concurrency for all DAGs to run

Any help would be deeply appreciated.

Thanks in advance!!

ashb requested review from kaxil and XD-DENG as code owners November 26, 2021 14:35

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Nov 26, 2021

ashb added this to the Airflow 2.3.0 milestone Nov 26, 2021

ashb force-pushed the remove-postgres-lock-failed-error branch from ea278b6 to f09603b Compare November 26, 2021 14:46

kaxil reviewed Nov 26, 2021

View reviewed changes

airflow/utils/db.py Show resolved Hide resolved

airflow/utils/db.py Outdated Show resolved Hide resolved

airflow/utils/db.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

ashb mentioned this pull request Nov 26, 2021

Helm scheduler HA : postgresql error on slot_pool lock #19811

Closed

2 tasks

ashb and others added 3 commits November 26, 2021 15:23

Update airflow/jobs/scheduler_job.py

54ed8aa

Co-authored-by: Kaxil Naik <[email protected]>

fixup! Remove magic constants from global DB locks

cc74158

Use same db lock for initdb and upgradedb

f18e2cd

They do conceptually very similar things and when one is running the other shouldn't either. (This is unlikely to ever be hit in practice)

ashb requested review from kaxil, uranusjr and potiuk November 29, 2021 11:13

kaxil previously approved these changes Nov 29, 2021

View reviewed changes

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Nov 29, 2021

kaxil self-requested a review November 29, 2021 13:11

uranusjr reviewed Nov 29, 2021

View reviewed changes

airflow/utils/db.py Outdated Show resolved Hide resolved

airflow/utils/db.py Outdated Show resolved Hide resolved

airflow/utils/db.py Outdated Show resolved Hide resolved

ashb commented Nov 29, 2021

View reviewed changes

airflow/utils/db.py Outdated Show resolved Hide resolved

ashb and others added 2 commits November 29, 2021 15:13

fixup! Remove magic constants from global DB locks

3b0ec62

Apply suggestions from code review

7b91167

Co-authored-by: Tzu-ping Chung <[email protected]>

uranusjr approved these changes Nov 29, 2021

View reviewed changes

ashb merged commit 9c69f1c into apache:main Nov 29, 2021

ashb deleted the remove-postgres-lock-failed-error branch November 29, 2021 18:42

potiuk mentioned this pull request Dec 8, 2021

Restore stability and unquarantine all test_scheduler_job tests #19860

Merged

thesuperzapper mentioned this pull request Dec 14, 2021

ERROR: could not obtain lock on row in relation "slot_pool" airflow-helm/charts#475

Closed

ephraimbuddy mentioned this pull request Mar 4, 2022

"deadlock detected" error in dag table updates #21947

Closed

2 tasks

jedcunningham added the type:improvement Changelog: Improvements label Apr 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid littering postgres server logs with "could not obtain lock" with HA schedulers #19842

Avoid littering postgres server logs with "could not obtain lock" with HA schedulers #19842

ashb commented Nov 26, 2021

ashb commented Nov 26, 2021

ashb commented Nov 26, 2021

potiuk commented Nov 26, 2021

github-actions bot commented Nov 29, 2021

kaxil commented Nov 29, 2021

ashb commented Nov 29, 2021 •

edited

Loading

ashb commented Nov 29, 2021

uranusjr left a comment

AhmadBenMaallem commented Jan 18, 2022

ashb commented Jan 18, 2022

fbertos commented May 20, 2024

deepaksood619 commented May 28, 2024 •

edited

Loading

Avoid littering postgres server logs with "could not obtain lock" with HA schedulers #19842

Avoid littering postgres server logs with "could not obtain lock" with HA schedulers #19842

Conversation

ashb commented Nov 26, 2021

ashb commented Nov 26, 2021

ashb commented Nov 26, 2021

potiuk commented Nov 26, 2021

github-actions bot commented Nov 29, 2021

kaxil commented Nov 29, 2021

ashb commented Nov 29, 2021 • edited Loading

ashb commented Nov 29, 2021

uranusjr left a comment

Choose a reason for hiding this comment

AhmadBenMaallem commented Jan 18, 2022

ashb commented Jan 18, 2022

fbertos commented May 20, 2024

deepaksood619 commented May 28, 2024 • edited Loading

ashb commented Nov 29, 2021 •

edited

Loading

deepaksood619 commented May 28, 2024 •

edited

Loading