Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Room deletion (shutdown) fail in a constant loop due to non-serializable access caused by PostgreSQL isolation levels #10294

Closed
PeterCxy opened this issue Jul 1, 2021 · 11 comments · Fixed by #12942
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@PeterCxy
Copy link

PeterCxy commented Jul 1, 2021

Description

When using the room deletion api to remove a large room (such as Matrix HQ) from the server, the purging process, if it needs more than a few seconds to finish, can sometimes enter a constant fail-retry loop due to unable to serialize access (because the tables are concurrently accessed and modified by other transactions constantly on a running server).

Steps to reproduce

  • On a moderately busy server (e.g. being in multiple moderately-sized federated rooms), try to purge a large federated room using the delete room api
  • Observe that the process gets stuck with unable to serialize access being reported in the logs. You can also observe the behavior using PgHero, in which one exact long-running query will appear again and again on a regular interval, indicating Synapse has been retrying it again and again.

Version information

  • Version: 1.37.1

  • Install method: pip

  • Platform: Debian 10 "buster", in a LXC container

Notes

The error will go away if the isolation level is changed to the lowest READ COMMITTED for the room-purging transaction, though I am not sure if this is correct or not, but I assume it should be fine given that we are just deleting everything related to a room.

diff --git a/synapse/storage/databases/main/purge_events.py b/synapse/storage/databases/main/purge_events.py
index 7fb7780d0..2619a6602 100644
--- a/synapse/storage/databases/main/purge_events.py
+++ b/synapse/storage/databases/main/purge_events.py
@@ -313,6 +313,7 @@ class PurgeEventsStore(StateGroupWorkerStore, CacheInvalidationWorkerStore):
         )

     def _purge_room_txn(self, txn, room_id: str) -> List[int]:
+        txn.execute("SET TRANSACTION ISOLATION LEVEL READ COMMITTED")
         # First we fetch all the state groups that should be deleted, before
         # we delete that information.
         txn.execute(

On a second note, is there a reason why the isolation level is set to REPEATABLE READ by default globally? Does Synapse really need REPEATABLE READ on every transaction?

@erikjohnston erikjohnston added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jul 23, 2021
@maranda
Copy link

maranda commented Jul 26, 2021

I think I replicated what should be the reported symptoms even on small empty rooms, just that the master process stops responding completely and makes Synapse unusable, there's also deadlocks reported by postgres:

2021-07-25 16:03:27.599 UTC [88361] synapse_user@synapse ERROR:  deadlock detected
2021-07-25 16:03:27.599 UTC [88361] synapse_user@synapse DETAIL:  Process 88361 waits for ShareLock on transaction 199189797; blocked by process 88362.
        Process 88362 waits for ShareLock on transaction 199188675; blocked by process 88361.
        Process 88361: DELETE FROM event_json WHERE event_id IN ( SELECT event_id FROM events WHERE room_id='!ceAlNsZbKnKwkXshUJ:services.aria-net.org' )
        Process 88362: UPDATE event_json SET json = '{"auth_events":["$uV3PiiusEFaYkHJD9XGuuy3IoOI_35oRXD_KWuBBNBA","$VkJP1G-5_r7H0phbLMD21sExNzilqpkAaDIJ0dPQ4H4","$HkDKzsAiqZ0UjSrY4P30Mpcg6S2B64_W_pUNKE347eQ"],"content":{},"dep>
2021-07-25 16:03:27.599 UTC [88361] synapse_user@synapse HINT:  See server log for query details.
2021-07-25 16:03:27.599 UTC [88361] synapse_user@synapse CONTEXT:  while deleting tuple (1356100,2) in relation "event_json"
2021-07-25 16:03:27.599 UTC [88361] synapse_user@synapse STATEMENT:  DELETE FROM event_json WHERE event_id IN ( SELECT event_id FROM events WHERE room_id='!ceAlNsZbKnKwkXshUJ:services.aria-net.org' )
2021-07-25 16:05:10.719 UTC [88357] synapse_user@synapse ERROR:  could not serialize access due to concurrent update
2021-07-25 16:05:10.719 UTC [88357] synapse_user@synapse STATEMENT:  DELETE FROM event_json WHERE event_id IN ( SELECT event_id FROM events WHERE room_id='!yhuhlVYNMYpjlzRXhW:services.aria-net.org' )
2021-07-25 16:05:45.430 UTC [88361] synapse_user@synapse ERROR:  could not serialize access due to concurrent update
2021-07-25 16:05:45.430 UTC [88361] synapse_user@synapse STATEMENT:  DELETE FROM event_json WHERE event_id IN ( SELECT event_id FROM events WHERE room_id='!ceAlNsZbKnKwkXshUJ:services.aria-net.org' )
2021-07-25 16:06:44.606 UTC [88363] synapse_user@synapse ERROR:  could not serialize access due to concurrent update

This is quite a showstopper atm.

@PeterCxy
Copy link
Author

@maranda Does the patch I posted in the issue fix it for you?

@maranda
Copy link

maranda commented Jul 26, 2021

I'm patching Synapse as we speak, I'll be back with results asap.

@maranda
Copy link

maranda commented Jul 26, 2021

It fixes the deadlocks @PeterCxy but not the Synapse master process from staggering and then becoming unresponsive. Requests in flight start accumulating and then it dies.

@bradtgmurray
Copy link
Contributor

Seeing this happen to us as well. We've got some fairly large rooms that we're looking to purge (1 million+ events) and after grinding through the tables for over an hour we'll get one table fail to get cleaned up and then the transaction has to start over again.

Some thoughts:

Can we somehow avoid concurrent updates? We've only deleted two of these large rooms so far and both of them hit a concurrent update in room_stats_state, which presumably is some period stat tracking. Can we disable stat tracking for the room before purging it?

Does this all need to be in one transaction? Can we somehow kill the room in one transaction and then do the subsequent cleanup in subsequent transactions?

@babolivier
Copy link
Contributor

babolivier commented Dec 13, 2021

We had a quick chat about this earlier in #synapse-dev today.

It's preferable to have the whole operation run in a single transaction, since otherwise a failure halfway through the process would likely lead to inconsistencies in the database. The code is written with the assumption that, since all local users have left the room before we start purging it, so concurrent access shouldn't be a thing, but it fails to account for the fact that other background processes of Synapse (such as updating room stats) might cause them.

We should try to pinpoint all of the processes that try to access/update data for a room after all of its local users have left, and try to make them not do that (or at least be more mindful that the room they're trying to work on might be in the process of being purged from the database), which isn't trivial. Or for a few of them (such as room stats) we could potentially move them to a new transaction, though this would need to be approached on a case by case basis.

On why Synapse defaults to REPEATABLE READ (it does), the reason is an abundance of caution because some transactions Synapse runs do require reads to be repeatable. Though this is very likely not the case for most of them, and as we can see it can cause some issues, so in the future we'll probably want to stop using it everywhere and default to READ COMMITTED. #11567 tracks this side of the issue.

@richvdh
Copy link
Member

richvdh commented Jan 27, 2022

one possible cause of this: rows are removed from event_push_actions once they have been processed (or once they are more than a month old); this will conflict with the purge operation.

@richvdh
Copy link
Member

richvdh commented Jan 27, 2022

another: we purge rows from stream_ordering_to_exterm when they are a month old. Again, this can conflict with purges.

An easy solution to both stream_ordering_to_exterm and event_push_actions is to do them at the start of the purge operation rather than the end, so that there is less chance of a concurrent update happening between the start of the purge transaction and the delete on stream_ordering_to_exterm/event_push_actions. That said, now that #11799 has landed (thanks to @Fizzadar), we can probably just set the isolation level for the purge transaction to READ COMMITTED.

@richvdh
Copy link
Member

richvdh commented Jan 27, 2022

In the meantime, a workaround is to do the following directly on the database before running the purge:

delete from stream_ordering_to_exterm where room_id='<room id>';
delete from event_push_actions where room_id='<room id>';

This should be safe provided all local users have left the room.

@bradtgmurray
Copy link
Contributor

bradtgmurray commented Apr 26, 2022

Just bumping this up with some more data:

Trying to purge a room with 800k~ events. Deleting the room from each event related table is taking 20~ minutes (events, events_json, event_edges, etc) and the whole cycle is taking about 2 hours before we hit could not serialize access due to concurrent delete 4/5 shortly after removing <room id> from stream_ordering_to_exterm and then the 2 hours of database hammering starts over again.

@Fizzadar
Copy link
Contributor

Fizzadar commented Jun 6, 2022

Happy to confirm that the PR above has fixed the issue on our synapse instance 🎉

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants