-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid littering postgres server logs with "could not obtain lock" with HA schedulers #19842
Avoid littering postgres server logs with "could not obtain lock" with HA schedulers #19842
Conversation
And `create_global_lock` isn't anything to do with session, so I have moved it to utils.db instead. And as part of this I changed the lock id that `airflow db reset` uses to share one with `airflow db upgrade` -- there's no point blocking upgrade if a reset is going to clobber it halfway through.
This could probably go in to 2.2.x, but I've marked it as 2.3 as I haven't tested this extensively yet. |
If you are running multiple schedulers on PostgreSQL, it is likely that sooner or later you will have one scheduler fail the race to enter the critical section (which is fine, and expected). However this can end up spamming the DB logs with errors like this: ``` Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] ERROR: could not obtain lock on row in relation "slot_pool" Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] STATEMENT: SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots Nov 26 14:08:48 sinope postgres[709953]: FROM slot_pool FOR UPDATE NOWAIT Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] ERROR: could not obtain lock on row in relation "slot_pool" Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] STATEMENT: SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots Nov 26 14:08:49 sinope postgres[709954]: FROM slot_pool FOR UPDATE NOWAIT ``` If you are really unlucky that can end up happening over and over and over again. So to avoid this error, for PostgreSQL only, we first try to acquire an "advisory lock" (advisory because it's up to the application to respect it), and if we cannot raise an error _like_ would have happened from the `FOR UPDATE NOWAIT`. (We still obtain the exclusive log on the pool rows so that the rows are locked.)
ea278b6
to
f09603b
Compare
Oh nice. Will take a close look later :) |
Co-authored-by: Kaxil Naik <[email protected]>
They do conceptually very similar things and when one is running the other shouldn't either. (This is unlikely to ever be hit in practice)
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
Hmm tests are failing:
|
Interesting.... it was passing without changing the lock. Let me think. |
Co-authored-by: Tzu-ping Chung <[email protected]>
Fixed in 3b0ec62 -- the connection we issued the lock from was closed when the migrations ran. The fix there is to use the same connection so that when the migrations code is finished with it, it doesn't get closed, so we can still lock it. I suspect this might have been the problem with the MSSQL locking too -- it's just that on Postgres we never noticed as it was only a warning and the return value was ignored! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MySQL and Postgres tests are all passing, so this should be good to go.
…h HA schedulers (apache#19842) * Remove magic constants from global DB locks And `create_global_lock` isn't anything to do with session, so I have moved it to utils.db instead. And as part of this I changed the lock id that `airflow db reset` uses to share one with `airflow db upgrade` -- there's no point blocking upgrade if a reset is going to clobber it halfway through. * Avoid littering Postgres server logs with "could not obtain lock" If you are running multiple schedulers on PostgreSQL, it is likely that sooner or later you will have one scheduler fail the race to enter the critical section (which is fine, and expected). However this can end up spamming the DB logs with errors like this: ``` Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] ERROR: could not obtain lock on row in relation "slot_pool" Nov 26 14:08:48 sinope postgres[709953]: 2021-11-26 14:08:48.672 GMT [709953] STATEMENT: SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots Nov 26 14:08:48 sinope postgres[709953]: FROM slot_pool FOR UPDATE NOWAIT Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] ERROR: could not obtain lock on row in relation "slot_pool" Nov 26 14:08:49 sinope postgres[709954]: 2021-11-26 14:08:49.730 GMT [709954] STATEMENT: SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots Nov 26 14:08:49 sinope postgres[709954]: FROM slot_pool FOR UPDATE NOWAIT ``` If you are really unlucky that can end up happening over and over and over again. So to avoid this error, for PostgreSQL only, we first try to acquire an "advisory lock" (advisory because it's up to the application to respect it), and if we cannot raise an error _like_ would have happened from the `FOR UPDATE NOWAIT`. (We still obtain the exclusive log on the pool rows so that the rows are locked.) * Use same db lock for initdb and upgradedb They do conceptually very similar things and when one is running the other shouldn't either. (This is unlikely to ever be hit in practice) Co-authored-by: Kaxil Naik <[email protected]> Co-authored-by: Tzu-ping Chung <[email protected]>
Has this fix been integrated into an official docker image ? |
No, as this hasn't been included in a release yet -- it's slated for 2.3 And while it might be annoying, seeing these in the DB in the logs is not an error as far as Airflow is concerned. |
Dear all, Looks like this is happening again as we are receiving the following messages after upgrading to version Airflow 2.9.1 (we are using PostgreSQL 15.7.0): 2024-05-20 09:51:48.336 GMT [122] ERROR: could not obtain lock on row in relation "dag_run" Could anyone confirm if this issue is back for anybody else? Thank you in advance!!! |
Yes this is happening again for us too. Airflow version - Airflow 2.9.1 Postgres Version - PostgreSQL 11.22 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit Currently we are running 3 schedulers parallely airflow-postgresql-0 postgresql 2024-05-29T01:12:13.163792559+05:30 2024-05-28 19:42:13.163 GMT [4019479] ERROR: could not obtain lock on row in relation "dag_run"
airflow-postgresql-0 postgresql 2024-05-29T01:12:13.163834190+05:30 2024-05-28 19:42:13.163 GMT [4019479] STATEMENT: SELECT dag_run.state AS dag_run_state, dag_run.id AS dag_run_id, dag_run.dag_id AS dag_run_dag_id, dag_run.queued_at AS dag_run_queued_at, dag_run.execution_date AS dag_run_execution_date, dag_run.start_date AS dag_run_start_date, dag_run.end_date AS dag_run_end_date, dag_run.run_id AS dag_run_run_id, dag_run.creating_job_id AS dag_run_creating_job_id, dag_run.external_trigger AS dag_run_external_trigger, dag_run.run_type AS dag_run_run_type, dag_run.conf AS dag_run_conf, dag_run.data_interval_start AS dag_run_data_interval_start, dag_run.data_interval_end AS dag_run_data_interval_end, dag_run.last_scheduling_decision AS dag_run_last_scheduling_decision, dag_run.dag_hash AS dag_run_dag_hash, dag_run.log_template_id AS dag_run_log_template_id, dag_run.updated_at AS dag_run_updated_at, dag_run.clear_number AS dag_run_clear_number
airflow-postgresql-0 postgresql 2024-05-29T01:12:13.163839590+05:30 FROM dag_run
airflow-postgresql-0 postgresql 2024-05-29T01:12:13.163844440+05:30 WHERE dag_run.dag_id = 'xxx' AND dag_run.run_id = 'scheduled__2024-05-28T19:41:00+00:00' FOR UPDATE NOWAIT This is also being resolved automatically in some time, and coming intermittently. Because of these our tasks are getting into queued slot and not able to be run in worker. We have enough worker concurrency for all DAGs to run Any help would be deeply appreciated. Thanks in advance!! |
If you are running multiple schedulers on PostgreSQL, it is likely that sooner or later you will have one scheduler fail the race to enter the critical section (which is fine, and expected).
However this can end up spamming the DB logs with errors like this:
If you are really unlucky that can end up happening over and over and over again.
So to avoid this error, for PostgreSQL only, we first try to acquire an "advisory lock" (advisory because it's up to the application to respect it), and if we cannot raise an error like would have happened from the
FOR UPDATE NOWAIT
.(We still obtain the exclusive log on the pool rows so that the rows are locked.)
This PR is split in to two commits, the second obtains the lock, and the first commit refactors the existing global locks to use enums to remove magic constants as these (integer for postgres) lock ids are "global", so we need to be sure the scheduler's lock doesn't clash with the
db upgrade
lock.