Handle race between persisting an event and un-partial stating a room #13100

squahtx · 2022-06-17T13:23:07Z

Handle race between persisting an event and un-partial stating a room

Whenever we want to persist an event, we first compute an event context,
which includes the state at the event and a flag indicating whether the
state is partial. After a lot of processing, we finally try to store the
event in the database, which can fail for partial state events when the
containing room has been un-partial stated in the meantime.

We detect the race as a foreign key constraint failure in the data store
layer and turn it into a special PartialStateConflictError exception,
which makes its way up to the method in which we computed the event
context.

To make things difficult, the exception needs to cross a replication
request: /fed_send_events for events coming over federation and
/send_event for events from clients. We transport the
PartialStateConflictError as a 409 Conflict over replication and
turn 409s back into PartialStateConflictErrors on the worker making
the request.

All client events go through
EventCreationHandler.handle_new_client_event, which is called in
a lot of places. Instead of trying to update all the code which
creates client events, we turn the PartialStateConflictError into a
429 Too Many Requests in
EventCreationHandler.handle_new_client_event and hope that clients
take it as a hint to retry their request.

On the federation event side, there are 7 places which compute event
contexts. 4 of them use outlier event contexts:
FederationEventHandler._auth_and_persist_outliers_inner,
FederationHandler.do_knock, FederationHandler.on_invite_request and
FederationHandler.do_remotely_reject_invite. These events won't have
the partial state flag, so we do not need to do anything for then.

The remaining 3 paths which create events are
FederationEventHandler.process_remote_join,
FederationEventHandler.on_send_membership_event and
FederationEventHandler._process_received_pdu.

We can't experience the race in process_remote_join, unless we're
handling an additional join into a partial state room, which currently
blocks, so we make no attempt to handle it correctly.

on_send_membership_event is only called by
FederationServer._on_send_membership_event, so we catch the
PartialStateConflictError there and retry just once.

_process_received_pdu is called by on_receive_pdu for incoming
events and _process_pulled_event for backfill. The latter should never
try to persist partial state events, so we ignore it. We catch the
PartialStateConflictError in on_receive_pdu and retry just once.

Refering to the graph of code paths in
#12988 (comment)
may make the above make more sense.

Can be reviewed commit by commit, though it's still easy to get lost.
Refer to the handy picture above to figure out where things fit in.

Catch `IntegrityError`s instead of `DatabaseError`s and downgrade the log message to INFO.

Define a `PartialStateConflictError` exception, to be raised when persisting a partial state event into an un-partial stated room. Signed-off-by: Sean Quah <[email protected]>

…vent in an un-partial stated room Raise a `PartialStateConflictError` in `PersistEventsStore.store_event_state_mappings_txn` when we try to persist a partial state event in an un-partial stated room. Also document the exception in the docstrings for `PersistEventsStore._persist_events_and_state_updates`, `_persist_events_txn` and `_update_outliers_txn`. Signed-off-by: Sean Quah <[email protected]>

…roller` Update the docstrings for `persist_event`, `persist_events` and `_persist_event_batch`. Signed-off-by: Sean Quah <[email protected]>

…_and_notify_client_event` The `PartialStateConflictError` comes from the call to `EventsPersistenceStorageController.persist_event` in the middle of the method. Signed-off-by: Sean Quah <[email protected]>

Signed-off-by: Sean Quah <[email protected]>

Instead of sprinkling retries for client events all over the place, raise a 503 Service Unavailable in `EventCreationHandler.handle_new_client_event`, which all client events go through. A 503 is usually temporary and it is hoped that clients will retry whatever they are doing. Signed-off-by: Sean Quah <[email protected]>

…tion Signed-off-by: Sean Quah <[email protected]>

…ess_remote_join` Convert `PartialStateConflictError`s to 503s when processing remote joins. We make no attempt to handle the error correctly, since it can only occur on additional joins into partial state rooms, which isn't supported yet. Signed-off-by: Sean Quah <[email protected]>

…push_actions_and_persist_event` The `PartialStateConflictError` comes from the call to `persist_events_and_notify` near the end. Signed-off-by: Sean Quah <[email protected]>

Signed-off-by: Sean Quah <[email protected]>

Retry `_process_received_pdu` on `PartialStateConflictError` in `FederationEventHandler.on_receive_pdu`. Document `PartialStateConflictError` in the docstring for `FederationEventHandler._processed_received_pdu`. The exception can come from the call to `FederationEventHandler._run_push_actions_and_persist_event`. We ignore `FederationEventHandler._process_pulled_event`, because those events should not be persisted with partial state. Signed-off-by: Sean Quah <[email protected]>

synapse/handlers/federation.py

reivilibre

This basically looks good to me.

That said:

I'm not sure what your question is on about — sorry :/
I'm not a fan giving 503s to clients because they hit a race (but I could be convinced that this can be fixed with a later PR; in that case, maybe it's just time to add a TODO and open an issue?)

reivilibre · 2022-06-29T11:00:52Z

synapse/storage/databases/main/events.py

+    This error should not be exposed to clients.
+    """
+
+    def __init__(self) -> None:
+        super().__init__(
+            HTTPStatus.CONFLICT,


This being a SynapseError (one with an HTTP status code) if we shouldn't expose it to clients confused me.

I think you're doing this for replication reasons; maybe it would be worth noting that in the docstring to explain why it has a special HTTP response code.

I'll update the docstring.

reivilibre · 2022-06-29T11:04:04Z

synapse/handlers/federation.py

+                    "Room %s was un-partial stated while processing remote join.",
+                    room_id,
+                )
+                raise SynapseError(HTTPStatus.SERVICE_UNAVAILABLE, e.msg, e.errcode)


IIRC this is a bad response code to return because it makes e.g. CloudFlare start to treat us as being down.
I also am not the biggest fan of requiring the client to retry it personally. I think I'd like to see it automatically retried, but appreciate that will be a pain. :/

Maybe the right thing to do is accept this for now, but TODO it/open issue to get around to retrying this in a separate PR for readability?

Is there a more appropriate status code that we can use?

I'm still undecided about what to do about the client paths. There's just so many of them :/

I've gone for 429 Too Many Requests with a retry_after_ms of 0, which has the closest semantics to what we want and ought to avoid any reverse proxy weirdness.

I think it's okay to expect clients to retry requests. They have to have retry logic anyway, for robustness in the face of bad connectivity. I wouldn't expect this extra case to add more complexity to clients.

reivilibre · 2022-06-29T11:24:52Z

synapse/federation/federation_server.py

+                "Room %s was un-partial stated during `on_send_membership_event`, trying again.",
+                room_id,
+            )
+            return await self._federation_event_handler.on_send_membership_event(


so to clarify, there should only ever be one retry needed because once it's un-partial-stated, it can't conflict anymore?

That's right, a room can only be un-partial-stated once.
Unless we leave it or purge it, but I don't know what happens in that case, even in the absence of faster room joins.

reivilibre · 2022-06-29T11:42:22Z

synapse/storage/databases/main/room.py

+        except self.db_pool.engine.module.IntegrityError as e:
+            # Assume that any `IntegrityError`s are due to partial state events.


I wish we could get some way of narrowing this down so we don't have to assume it, but I can't see a way short of matching the error string, which sounds very dodgy.

reivilibre · 2022-06-29T11:52:58Z

synapse/handlers/message.py

            )
-        ).addErrback(unwrapFirstError)
+            raise SynapseError(HTTPStatus.SERVICE_UNAVAILABLE, e.msg, e.errcode)


(another instance of the 503 that I'm not a fan of; see other thread)

synapse/handlers/federation.py

squahtx · 2022-06-29T12:38:05Z

I'm not sure what your question is on about — sorry :/

Which question?

squahtx · 2022-06-29T12:38:21Z

Apparently I wrote some comments a week ago but forgot to publish them...

squahtx · 2022-06-17T13:27:20Z

synapse/handlers/message.py

+                except SynapseError as e:
+                    if e.code == HTTPStatus.CONFLICT:
+                        raise PartialStateConflictError()
+                    raise


The way we transport the PartialStateConflictError across replication is pretty ugly. I'm open to alternative suggestions.

it feels ugly but it also is straight forward so it has that going for it. I think it's fine and we can always change it later

squahtx · 2022-06-17T13:30:04Z

synapse/storage/databases/main/events.py

+
+    def __init__(self) -> None:
+        super().__init__(
+            HTTPStatus.CONFLICT,


The 409 Conflict status code here is from the perspective of replication: the replication /send_event or /fed_send_events request includes the event context with the partial state flag, which is in conflict with the current state of the homeserver (and it makes no sense to retry the request).

synapse/handlers/message.py

synapse/handlers/federation.py

squahtx · 2022-06-29T12:33:17Z

synapse/federation/federation_server.py

+                "Room %s was un-partial stated during `on_send_membership_event`, trying again.",
+                room_id,
+            )
+            return await self._federation_event_handler.on_send_membership_event(


That's right, a room can only be un-partial-stated once.
Unless we leave it or purge it, but I don't know what happens in that case, even in the absence of faster room joins.

The latter can make reverse proxies do unwanted things.

reivilibre · 2022-07-04T11:13:17Z

synapse/handlers/federation.py

                # TODO(faster_joins): `_should_perform_remote_join` suggests that we may
                #   do a remote join for restricted rooms even if we have full state.
                logger.error(
                    "Room %s was un-partial stated while processing remote join.",
                    room_id,
                )
-                raise SynapseError(HTTPStatus.SERVICE_UNAVAILABLE, e.msg, e.errcode)
+                raise LimitExceededError(msg=e.msg, errcode=e.errcode, retry_after_ms=0)


should we consider opening an issue as to talk about whether we want to make this better?
I see what you're saying but it still feels to me that this is worth thinking about (but I don't want to block this PR on that).
(For sending messages, some clients seem to prompt you to retry sending the message if it fails, I'm not sure about the exact circumstances but leaving it like this means we'll want to check that, so perhaps defer it to an issue regardless)

that's fair. I've filed it as #13173.

reivilibre · 2022-07-04T11:14:30Z

synapse/handlers/message.py

+                except SynapseError as e:
+                    if e.code == HTTPStatus.CONFLICT:
+                        raise PartialStateConflictError()
+                    raise


it feels ugly but it also is straight forward so it has that going for it. I think it's fine and we can always change it later

squahtx · 2022-07-04T17:56:58Z

CI's failing, pending the merge of #402.

…joins_fix_departial_stating_race

squahtx · 2022-07-05T15:03:45Z

TestRestrictedRoomsLocalJoin and TestSendJoinPartialStateResponse are known worker mode flakes: #13161

Sean Quah added 12 commits June 17, 2022 13:56

Docstring and tighten up clear_partial_state_room

6c0ab7a

Catch `IntegrityError`s instead of `DatabaseError`s and downgrade the log message to INFO.

Define a PartialStateConflictError exception

72c9ccc

Define a `PartialStateConflictError` exception, to be raised when persisting a partial state event into an un-partial stated room. Signed-off-by: Sean Quah <[email protected]>

Document PartialStateConflictError in `EventsPersistenceStorageCont…

e49796d

…roller` Update the docstrings for `persist_event`, `persist_events` and `_persist_event_batch`. Signed-off-by: Sean Quah <[email protected]>

Document PartialStateConflictError in `EventCreationHandler.persist…

cdb4f3c

…_and_notify_client_event` The `PartialStateConflictError` comes from the call to `EventsPersistenceStorageController.persist_event` in the middle of the method. Signed-off-by: Sean Quah <[email protected]>

Transport PartialStateConflictError over /send_event replication

565b17c

Signed-off-by: Sean Quah <[email protected]>

Transport PartialStateConflictError over /fed_send_events replica…

10cd406

…tion Signed-off-by: Sean Quah <[email protected]>

Document PartialStateConflictError in `FederationEventHandler._run_…

49dd92b

…push_actions_and_persist_event` The `PartialStateConflictError` comes from the call to `persist_events_and_notify` near the end. Signed-off-by: Sean Quah <[email protected]>

Retry on_send_membership_event on a PartialStateConflictError

97d27ef

Signed-off-by: Sean Quah <[email protected]>

squahtx requested a review from a team as a code owner June 17, 2022 13:23

squahtx commented Jun 17, 2022

View reviewed changes

synapse/handlers/federation.py Outdated Show resolved Hide resolved

squahtx force-pushed the squah/faster_room_joins_fix_departial_stating_race branch from ed23f1d to 8af0caa Compare June 17, 2022 13:26

Sean Quah added 2 commits June 17, 2022 14:33

Add newsfile

a720f3b

Remove todo

4bc63cd

reivilibre self-assigned this Jun 29, 2022

reivilibre suggested changes Jun 29, 2022

View reviewed changes

reivilibre removed their assignment Jun 29, 2022

squahtx commented Jun 29, 2022

View reviewed changes

Sean Quah added 2 commits June 29, 2022 13:47

Explain why PartialStateConflictError has a status code

87175c7

Send 429 Too Many Requests instead of 502 Service Unavailable

2bd5e2f

The latter can make reverse proxies do unwanted things.

squahtx requested a review from reivilibre July 1, 2022 20:13

Fix missing import

d709513

reivilibre approved these changes Jul 4, 2022

View reviewed changes

squahtx mentioned this pull request Jul 4, 2022

Faster joins: improve handling of race where we can fail to persist a client event with partial state after _sync_partial_state_room clears the partial state flag for a room #13173

Closed

Merge remote-tracking branch 'origin/develop' into squah/faster_room_…

999de65

…joins_fix_departial_stating_race

squahtx merged commit 68db233 into develop Jul 5, 2022

squahtx deleted the squah/faster_room_joins_fix_departial_stating_race branch July 5, 2022 15:12

squahtx linked an issue Jul 5, 2022 that may be closed by this pull request

Faster joins: fix race where we can fail to persist an incoming event with partial state after _sync_partial_state_room clears the partial state flag for a room #12988

Closed

davidmehren mentioned this pull request Jul 19, 2022

Performance Regression in 1.63 #13331

Closed

reivilibre mentioned this pull request Jul 20, 2022

Retry leaving of room in TestRestrictedRoomsRemoteJoinFailOverInMSC3787Room matrix-org/complement#414

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle race between persisting an event and un-partial stating a room #13100

Handle race between persisting an event and un-partial stating a room #13100

squahtx commented Jun 17, 2022 •

edited

Loading

reivilibre left a comment

reivilibre Jun 29, 2022

squahtx Jun 29, 2022

reivilibre Jun 29, 2022

squahtx Jun 29, 2022

squahtx Jun 29, 2022

squahtx Jul 1, 2022

reivilibre Jun 29, 2022

squahtx Jun 29, 2022

reivilibre Jun 29, 2022

reivilibre Jun 29, 2022

squahtx commented Jun 29, 2022

squahtx commented Jun 29, 2022

squahtx Jun 17, 2022

reivilibre Jul 4, 2022

squahtx Jun 17, 2022

squahtx Jun 29, 2022

reivilibre Jul 4, 2022

squahtx Jul 4, 2022

reivilibre Jul 4, 2022

squahtx commented Jul 4, 2022

squahtx commented Jul 5, 2022

		except self.db_pool.engine.module.IntegrityError as e:
		# Assume that any `IntegrityError`s are due to partial state events.

Handle race between persisting an event and un-partial stating a room #13100

Handle race between persisting an event and un-partial stating a room #13100

Conversation

squahtx commented Jun 17, 2022 • edited Loading

reivilibre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squahtx commented Jun 29, 2022

squahtx commented Jun 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squahtx commented Jul 4, 2022

squahtx commented Jul 5, 2022

squahtx commented Jun 17, 2022 •

edited

Loading