Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid littering postgres server logs with "could not obtain lock" with HA schedulers #19842

Merged
merged 7 commits into from
Nov 29, 2021

Conversation

ashb
Copy link
Member

@ashb ashb commented Nov 26, 2021

If you are running multiple schedulers on PostgreSQL, it is likely that sooner or later you will have one scheduler fail the race to enter the critical section (which is fine, and expected).

However this can end up spamming the DB logs with errors like this:

Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] ERROR:  could not obtain lock on row in relation "slot_pool"
Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] STATEMENT:  SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots
Nov 26 14:08:48 sinope postgres[709953]:         FROM slot_pool FOR UPDATE NOWAIT
Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] ERROR:  could not obtain lock on row in relation "slot_pool"
Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] STATEMENT:  SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots
Nov 26 14:08:49 sinope postgres[709954]:         FROM slot_pool FOR UPDATE NOWAIT

If you are really unlucky that can end up happening over and over and over again.

So to avoid this error, for PostgreSQL only, we first try to acquire an "advisory lock" (advisory because it's up to the application to respect it), and if we cannot raise an error like would have happened from the FOR UPDATE NOWAIT.

(We still obtain the exclusive log on the pool rows so that the rows are locked.)

This PR is split in to two commits, the second obtains the lock, and the first commit refactors the existing global locks to use enums to remove magic constants as these (integer for postgres) lock ids are "global", so we need to be sure the scheduler's lock doesn't clash with the db upgrade lock.

And `create_global_lock` isn't anything to do with session, so I have
moved it to utils.db instead.

And as part of this I changed the lock id that `airflow db reset` uses
to share one with `airflow db upgrade` -- there's no point blocking
upgrade if a reset is going to clobber it halfway through.
@ashb ashb requested review from kaxil and XD-DENG as code owners November 26, 2021 14:35
@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Nov 26, 2021
@ashb
Copy link
Member Author

ashb commented Nov 26, 2021

/cc @wolfier @collinmcnulty

@ashb ashb added this to the Airflow 2.3.0 milestone Nov 26, 2021
@ashb
Copy link
Member Author

ashb commented Nov 26, 2021

This could probably go in to 2.2.x, but I've marked it as 2.3 as I haven't tested this extensively yet.

If you are running multiple schedulers on PostgreSQL, it is likely that
sooner or later you will have one scheduler fail the race to enter the
critical section (which is fine, and expected).

However this can end up spamming the DB logs with errors like this:

```
Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] ERROR:  could not obtain lock on row in relation "slot_pool"
Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] STATEMENT:  SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots
Nov 26 14:08:48 sinope postgres[709953]:         FROM slot_pool FOR UPDATE NOWAIT
Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] ERROR:  could not obtain lock on row in relation "slot_pool"
Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] STATEMENT:  SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots
Nov 26 14:08:49 sinope postgres[709954]:         FROM slot_pool FOR UPDATE NOWAIT
```

If you are really unlucky that can end up happening over and over and
over again.

So to avoid this error, for PostgreSQL only, we first try to acquire an
"advisory lock" (advisory because it's up to the application to respect
it), and if we cannot raise an error _like_ would have happened from the
`FOR UPDATE NOWAIT`.

(We still obtain the exclusive log on the pool rows so that the rows are locked.)
@ashb ashb force-pushed the remove-postgres-lock-failed-error branch from ea278b6 to f09603b Compare November 26, 2021 14:46
@potiuk
Copy link
Member

potiuk commented Nov 26, 2021

Oh nice. Will take a close look later :)

airflow/utils/db.py Show resolved Hide resolved
airflow/utils/db.py Outdated Show resolved Hide resolved
airflow/utils/db.py Outdated Show resolved Hide resolved
airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved
ashb and others added 3 commits November 26, 2021 15:23
They do conceptually very similar things and when one is running the
other shouldn't either. (This is unlikely to ever be hit in practice)
@ashb ashb requested review from kaxil, uranusjr and potiuk November 29, 2021 11:13
kaxil
kaxil previously approved these changes Nov 29, 2021
@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Nov 29, 2021
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@kaxil
Copy link
Member

kaxil commented Nov 29, 2021

Hmm tests are failing:

 INFO  [alembic.runtime.migration] Running upgrade 7b2661a43ba3 -> be2bfac3da23, Add has_import_errors column to DagModel
  WARNI [airflow.models.crypto] empty cryptography key - values will not be stored encrypted.
  Traceback (most recent call last):
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
      cursor, statement, parameters, context
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
      cursor.execute(statement, parameters)
  psycopg2.errors.LockNotAvailable: canceling statement due to lock timeout
  
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "/opt/airflow/airflow/utils/db.py", line 1017, in create_global_lock
      conn.execute(text('SELECT pg_advisory_lock(:id)'), id=lock.value)
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1011, in execute
      return meth(self, multiparams, params)
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
      return connection._execute_clauseelement(self, multiparams, params)
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1130, in _execute_clauseelement
      distilled_params,
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1317, in _execute_context
      e, statement, parameters, cursor, context
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1511, in _handle_dbapi_exception
      sqlalchemy_exception, with_traceback=exc_info[2], from_=e
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
      raise exception
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
      cursor, statement, parameters, context
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
      cursor.execute(statement, parameters)
  sqlalchemy.exc.OperationalError: (psycopg2.errors.LockNotAvailable) canceling statement due to lock timeout
  
  [SQL: SELECT pg_advisory_lock(%(id)s)]
  [parameters: {'id': 1}]
  (Background on this error at: http://sqlalche.me/e/13/e3q8)
  
  During handling of the above exception, another exception occurred:
  
  Traceback (most recent call last):
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
      cursor, statement, parameters, context
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
      cursor.execute(statement, parameters)
  psycopg2.errors.InFailedSqlTransaction: current transaction is aborted, commands ignored until end of transaction block
  
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "/usr/local/bin/airflow", line 33, in <module>
      sys.exit(load_entry_point('apache-airflow', 'console_scripts', 'airflow')())
    File "/opt/airflow/airflow/__main__.py", line 48, in main
      args.func(args)
    File "/opt/airflow/airflow/cli/cli_parser.py", line 48, in command
      return func(*args, **kwargs)
    File "/opt/airflow/airflow/cli/commands/db_command.py", line 39, in resetdb
      db.resetdb()
    File "/opt/airflow/airflow/utils/session.py", line 69, in wrapper
      return func(*args, session=session, **kwargs)
    File "/opt/airflow/airflow/utils/db.py", line 932, in resetdb
      initdb(session=session)
    File "/opt/airflow/airflow/utils/session.py", line 66, in wrapper
      return func(*args, **kwargs)
    File "/opt/airflow/airflow/utils/db.py", line 599, in initdb
      with create_global_lock(session=session, lock=DBLocks.MIGRATIONS):
    File "/usr/local/lib/python3.7/contextlib.py", line 112, in __enter__
      return next(self.gen)
    File "/opt/airflow/airflow/utils/db.py", line 1030, in create_global_lock
      conn.execute('SET LOCK_TIMEOUT TO DEFAULT')
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1003, in execute
      return self._execute_text(object_, multiparams, params)
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1178, in _execute_text
      parameters,
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1317, in _execute_context
      e, statement, parameters, cursor, context
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1511, in _handle_dbapi_exception
      sqlalchemy_exception, with_traceback=exc_info[2], from_=e
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
      raise exception
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
      cursor, statement, parameters, context
    File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
      cursor.execute(statement, parameters)
  sqlalchemy.exc.InternalError: (psycopg2.errors.InFailedSqlTransaction) current transaction is aborted, commands ignored until end of transaction block
  
  [SQL: SET LOCK_TIMEOUT TO DEFAULT]
  (Background on this error at: http://sqlalche.me/e/13/2j85)

@kaxil kaxil self-requested a review November 29, 2021 13:11
@kaxil kaxil dismissed their stale review November 29, 2021 13:11

failing tests

@ashb
Copy link
Member Author

ashb commented Nov 29, 2021

Interesting.... it was passing without changing the lock. Let me think.

airflow/utils/db.py Outdated Show resolved Hide resolved
airflow/utils/db.py Outdated Show resolved Hide resolved
airflow/utils/db.py Outdated Show resolved Hide resolved
airflow/utils/db.py Outdated Show resolved Hide resolved
@ashb
Copy link
Member Author

ashb commented Nov 29, 2021

Fixed in 3b0ec62 -- the connection we issued the lock from was closed when the migrations ran.

The fix there is to use the same connection so that when the migrations code is finished with it, it doesn't get closed, so we can still lock it.

I suspect this might have been the problem with the MSSQL locking too -- it's just that on Postgres we never noticed as it was only a warning and the return value was ignored!

Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MySQL and Postgres tests are all passing, so this should be good to go.

@ashb ashb merged commit 9c69f1c into apache:main Nov 29, 2021
@ashb ashb deleted the remove-postgres-lock-failed-error branch November 29, 2021 18:42
dillonjohnson pushed a commit to dillonjohnson/airflow that referenced this pull request Dec 1, 2021
…h HA schedulers (apache#19842)

* Remove magic constants from global DB locks

And `create_global_lock` isn't anything to do with session, so I have
moved it to utils.db instead.

And as part of this I changed the lock id that `airflow db reset` uses
to share one with `airflow db upgrade` -- there's no point blocking
upgrade if a reset is going to clobber it halfway through.

* Avoid littering Postgres server logs with "could not obtain lock"

If you are running multiple schedulers on PostgreSQL, it is likely that
sooner or later you will have one scheduler fail the race to enter the
critical section (which is fine, and expected).

However this can end up spamming the DB logs with errors like this:

```
Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] ERROR:  could not obtain lock on row in relation "slot_pool"
Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] STATEMENT:  SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots
Nov 26 14:08:48 sinope postgres[709953]:         FROM slot_pool FOR UPDATE NOWAIT
Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] ERROR:  could not obtain lock on row in relation "slot_pool"
Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] STATEMENT:  SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots
Nov 26 14:08:49 sinope postgres[709954]:         FROM slot_pool FOR UPDATE NOWAIT
```

If you are really unlucky that can end up happening over and over and
over again.

So to avoid this error, for PostgreSQL only, we first try to acquire an
"advisory lock" (advisory because it's up to the application to respect
it), and if we cannot raise an error _like_ would have happened from the
`FOR UPDATE NOWAIT`.

(We still obtain the exclusive log on the pool rows so that the rows are locked.)

* Use same db lock for initdb and upgradedb

They do conceptually very similar things and when one is running the
other shouldn't either. (This is unlikely to ever be hit in practice)

Co-authored-by: Kaxil Naik <[email protected]>
Co-authored-by: Tzu-ping Chung <[email protected]>
@AhmadBenMaallem
Copy link

Has this fix been integrated into an official docker image ?

@ashb
Copy link
Member Author

ashb commented Jan 18, 2022

No, as this hasn't been included in a release yet -- it's slated for 2.3

And while it might be annoying, seeing these in the DB in the logs is not an error as far as Airflow is concerned.

@jedcunningham jedcunningham added the type:improvement Changelog: Improvements label Apr 7, 2022
@fbertos
Copy link
Contributor

fbertos commented May 20, 2024

Dear all,

Looks like this is happening again as we are receiving the following messages after upgrading to version Airflow 2.9.1 (we are using PostgreSQL 15.7.0):

2024-05-20 09:51:48.336 GMT [122] ERROR: could not obtain lock on row in relation "dag_run"
2024-05-20 09:51:48.336 GMT [122] STATEMENT: SELECT dag_run.state AS dag_run_state, dag_run.id AS dag_run_id, dag_run.dag_id AS dag_run_dag_id, dag_run.queued_at AS dag_run_queued_at, dag_run.execution_date AS dag_run_execution_date, dag_run.start_date AS dag_run_start_date, dag_run.end_date AS dag_run_end_date, dag_run.run_id AS dag_run_run_id, dag_run.creating_job_id AS dag_run_creating_job_id, dag_run.external_trigger AS dag_run_external_trigger, dag_run.run_type AS dag_run_run_type, dag_run.conf AS dag_run_conf, dag_run.data_interval_start AS dag_run_data_interval_start, dag_run.data_interval_end AS dag_run_data_interval_end, dag_run.last_scheduling_decision AS dag_run_last_scheduling_decision, dag_run.dag_hash AS dag_run_dag_hash, dag_run.log_template_id AS dag_run_log_template_id, dag_run.updated_at AS dag_run_updated_at, dag_run.clear_number AS dag_run_clear_number
FROM dag_run
WHERE dag_run.dag_id = '******' AND dag_run.run_id = 'scheduled__2024-05-20T09:47:00+00:00' FOR UPDATE NOWAIT

Could anyone confirm if this issue is back for anybody else?

Thank you in advance!!!

@deepaksood619
Copy link

deepaksood619 commented May 28, 2024

Yes this is happening again for us too.

Airflow version - Airflow 2.9.1

Postgres Version - PostgreSQL 11.22 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit

Currently we are running 3 schedulers parallely

airflow-postgresql-0 postgresql 2024-05-29T01:12:13.163792559+05:30 2024-05-28 19:42:13.163 GMT [4019479] ERROR:  could not obtain lock on row in relation "dag_run"

airflow-postgresql-0 postgresql 2024-05-29T01:12:13.163834190+05:30 2024-05-28 19:42:13.163 GMT [4019479] STATEMENT:  SELECT dag_run.state AS dag_run_state, dag_run.id AS dag_run_id, dag_run.dag_id AS dag_run_dag_id, dag_run.queued_at AS dag_run_queued_at, dag_run.execution_date AS dag_run_execution_date, dag_run.start_date AS dag_run_start_date, dag_run.end_date AS dag_run_end_date, dag_run.run_id AS dag_run_run_id, dag_run.creating_job_id AS dag_run_creating_job_id, dag_run.external_trigger AS dag_run_external_trigger, dag_run.run_type AS dag_run_run_type, dag_run.conf AS dag_run_conf, dag_run.data_interval_start AS dag_run_data_interval_start, dag_run.data_interval_end AS dag_run_data_interval_end, dag_run.last_scheduling_decision AS dag_run_last_scheduling_decision, dag_run.dag_hash AS dag_run_dag_hash, dag_run.log_template_id AS dag_run_log_template_id, dag_run.updated_at AS dag_run_updated_at, dag_run.clear_number AS dag_run_clear_number

airflow-postgresql-0 postgresql 2024-05-29T01:12:13.163839590+05:30 	FROM dag_run

airflow-postgresql-0 postgresql 2024-05-29T01:12:13.163844440+05:30 	WHERE dag_run.dag_id = 'xxx' AND dag_run.run_id = 'scheduled__2024-05-28T19:41:00+00:00' FOR UPDATE NOWAIT

This is also being resolved automatically in some time, and coming intermittently.

Because of these our tasks are getting into queued slot and not able to be run in worker.

We have enough worker concurrency for all DAGs to run

Screenshot 2024-05-29 at 1 18 37 AM

Any help would be deeply appreciated.

Thanks in advance!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge type:improvement Changelog: Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants