Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous aggregate refresh policy locks out all interaction with the main hypertable if a pg_dump is running #3554

Closed
phiresky opened this issue Sep 6, 2021 · 7 comments · Fixed by #5809

Comments

@phiresky
Copy link

phiresky commented Sep 6, 2021

Relevant system information:

  • OS: Debian bullseye
  • PostgreSQL version: 12.8
  • TimescaleDB version: 2.4.1
  • Installation method: apt install

Describe the bug
pg_dump was running and currently dumping a completely different table. This apparently holds a AccessSharedLock on all tables that it will dump in the future (see --jobs in the docs), including _timescaledb_catalog.continuous_aggs_invalidation_threshold.

Then, the continuous refresh policy started running, which tries to acquire a AccessExclusiveLock on continuous_aggs_invalidation_threshold, so that started waiting on pg_dump finishing. Why does this need an AccessExclusiveLock?

Then, other queries that tried to write to our main table (NOT an aggregate or anything) were blocked by the waiting continuous refresh policy, since interactions with the main hypertable require an AccessSharedLock on the invalidation threshold thing, and I guess postgresql doesn't reorder lock requests ever, even though two AccessSharedLocks could exist at the same time. This started piling up until all site operation was basically stopped with messages like:

"process 3620065 still waiting for AccessExclusiveLock on relation 2241176 of database 16401 after 1000.130 ms","Process holding the lock: 3594548. Wait queue: 3620065, 3618703, 3619894, 3619893, 3619807, 3619867, 3619759, 3620066, 3620067, 3620068, 3620069."

This took our whole site down yesterday since inserts in a timscaledb table failed, which is pretty surprising since I assumed anything about the continuous aggregate would not affect anything else (nothing was interacting with the aggregate in any way).

To Reproduce
I didn't test this, but unless I missed something this should reproduce it:

  1. have one hypertable t and one continuous aggregate a on that table
  2. start a transaction and LOCK _timescaledb_catalog.continuous_aggs_invalidation_threshold IN ACCESS SHARE MODE (this emulates the long-running pg_dump process)
  3. start a continuous aggregate update on a, this should block on (1) by requiring an AccessExclusiveLock
  4. run any query on t, this should block on (2) by requiring an AccessShareLock

Expected behavior

The contiuous aggregate update should either not require an AccessExclusiveLock or fail as soon as possible if this situation happens (e.g. set lock_timeout on aquiring that lock)
In general, I'd think an unused continuous aggregate shouldn't affect operations on the main table in any way.

I'd also expect some documentation about those "hidden" tables and how exactly they interact with each other. Can't find anything about this locking behaviour in the docs, only this issue #2308 which may in fact be related.

Actual behavior
Made our whole site cease functioning for an hour+ while pg_dump was running.

@erimatnor
Copy link
Contributor

You could try disabling background workers before dumping the data:

SELECT _timescaledb_internal.stop_background_workers();

@NunoFilipeSantos
Copy link
Contributor

Hi @phiresky has @erimatnor 's suggestion helped you?

@phiresky
Copy link
Author

phiresky commented Oct 22, 2021

I mean disabling updates does circumvent the issue, but in my case a continuous aggregate that wasn't actually used anywhere caused our whole database system to go down in the night due to compounding locks on the source table, so in my opinion that's a fairly large issue. It's not documented anywhere (I think) that updating a continuous aggregate locks the source table, especially not with an AccessExclusiveLock.

I'm also not a database expert, but I would never have expected that automatic aggregate updates would interact with database dumps in a way that causes everything to lock up, and I doubt other people would expect it either.

My solution was to just not use continuous aggregates at all for now because I have too much fear it'll affect other database operation again.

@mfundul
Copy link
Contributor

mfundul commented Oct 27, 2021

Quick question, was the continuous aggregate being refreshed by policy of the main hypertable, or was it of another hypertable?

@phiresky
Copy link
Author

phiresky commented Oct 27, 2021

The aggregate was of the main hypertable that was being locked yes, sorry if that wasn't clear. (I only have one hypertable currently - that hypertable is basically a log so it's involved in most interactions with other tables.)

@mfundul mfundul self-assigned this Nov 1, 2021
jlewallen added a commit to fieldkit/cloud that referenced this issue Aug 30, 2022
Ran into this issue in prod and ruined my morning:

timescale/timescaledb#3554

Basic story is if one of the refresh policy jobs runs while a pg_dump is running
the entire database will lock until the pg_dump is finished.
@mfundul mfundul removed their assignment Nov 12, 2022
@phlegx
Copy link

phlegx commented Dec 1, 2022

Dear all,

we are also facing this issue. When we Backup with pg_dump, then aggregrate tables/hypertable gets locked and then no data is ingested anymore.

We can fix this with as well with the workaround disabling the background worker with:

SELECT _timescaledb_internal.stop_background_workers();

then do the (long running) backup with pg_dump and then re-enable the workers again:

SELECT _timescaledb_internal.start_background_workers();

However I was wondering: Can this be somehow fixed within TimescaleDB. I'm not sure if it can be fixed in the code, but if yes then it probably should be. In my opinion a pg_dump should always work and do not block other DB operations at least not forever.

@estokes-vs
Copy link

Hello! We are also running into this issue occasionally, and are looking for methods to work around it if it can't be adjusted internally with TimescaleDB. Our current setup involves a setup that uses a primary database that has a high-availability replica, and we are connected to the replica for our pg_dump process.

One question I have is if these background_workers are actually necessary to have running on a replica as it should always reflect the state of the primary server?

fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue Aug 9, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix timescale#3554
fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue Aug 9, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix timescale#3554
fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue Aug 9, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix timescale#3554
fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue Aug 17, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix timescale#3554
fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue Aug 21, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix timescale#3554
fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue Aug 21, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix timescale#3554
fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue Aug 21, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix timescale#3554
fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue Aug 21, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix timescale#3554
fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue Aug 21, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix timescale#3554
fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue Aug 22, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix timescale#3554
fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue Aug 23, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix timescale#3554
fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue Aug 23, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix timescale#3554
fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue Aug 23, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix timescale#3554
fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue Aug 23, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix timescale#3554
fabriziomello added a commit to fabriziomello/timescaledb that referenced this issue Aug 23, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix timescale#3554
fabriziomello added a commit that referenced this issue Aug 23, 2023
When refreshing a Continuous Aggregate we take a table lock on
_timescaledb_catalog.continuous_aggs_invalidation_threshold when
processing the invalidation logs (the first transaction of the refresh
Continuous Aggregate procedure). It means that even two different
Continuous Aggregates over two different hypertables will wait each
other in the first phase of the refreshing procedure. Also it lead to
problems when a pg_dump is running because it take an AccessShareLock on
tables so Continuous Aggregate refresh execution will wait until the
pg_dump finish.

Improved it by relaxing the strong table-level lock to a row-level lock
so now the Continuous Aggregate refresh procedure can be executed in
multiple sessions with less locks.

Fix #3554
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants