Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Host dead after caches seem to break possibly due to multiple room delete commands in short period of time #10792

Closed
jaywink opened this issue Sep 10, 2021 · 3 comments
Labels
S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. z-p2 (Deprecated Label)

Comments

@jaywink
Copy link
Member

jaywink commented Sep 10, 2021

Description

From the logs it looks like a host admin deleted a room multiple times in a row within a few minutes. Initially this caused 403's for message sending for a bot user, which makes sense, but this soon turned to 500's until restarted, after which a 403 was again returned. Also some time after the room delete activity, a sync started to throw long stacktraces with 500 error. This is unconfirmed to have been fixed after restart, but it seems likely given the stack mentions caches.

From logs:

  • Multiple events of the following events within a few minutes:
    • synapse.handlers.room - 1339 - INFO - DELETE-6639- Shutting down room '!roomid:domain.tld'
    • synapse.handlers.room - 1348 - INFO - DELETE-6639- Kicking '@botuser:domain.tld' from '!roomid:domain.tld'...
  • @botuser:domain.tld starts to see 403
  • Purge events seems to run multiple times for the rooms within a short period, with various events like this
    • synapse.storage.databases.main.purge_events - 412 - INFO - DELETE-6639- [purge] removing !roomid:domain.tld from event_push_summary
  • Room send 403 turns into 500:
> 2021-09-09 18:02:40,129 - synapse.http.server - 93 - ERROR - PUT-6693- Failed handle request via 'RoomSendEventRestServlet': <XForwardedForRequest at 0x7f978dff48b0 method='PUT' uri='/_matrix/client/r0/rooms/!roomid:domain.tld/send/m.room.message/m1627686538.10576?access_token=<redacted>' clientproto='HTTP/1.1' site='8008'>
> Sep 9, 2021 @ 18:02:40.132synapse
Traceback (most recent call last):
Sep 9, 2021 @ 18:02:40.132synapse
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
Sep 9, 2021 @ 18:02:40.132synapse
    result = current_context.run(gen.send, result)
Sep 9, 2021 @ 18:02:40.132synapse
  File "/usr/local/lib/python3.8/site-packages/synapse/rest/client/room.py", line 242, in on_POST
Sep 9, 2021 @ 18:02:40.132synapse
    ) = await self.event_creation_handler.create_and_send_nonmember_event(
Sep 9, 2021 @ 18:02:40.132synapse
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/message.py", line 863, in create_and_send_nonmember_event
Sep 9, 2021 @ 18:02:40.132synapse
    event, context = await self.create_event(
Sep 9, 2021 @ 18:02:40.132synapse
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/message.py", line 623, in create_event
Sep 9, 2021 @ 18:02:40.132synapse
    event, context = await self.create_new_client_event(
Sep 9, 2021 @ 18:02:40.132synapse
  File "/usr/local/lib/python3.8/site-packages/synapse/util/metrics.py", line 91, in measured_func
Sep 9, 2021 @ 18:02:40.132synapse
    r = await func(self, *args, **kwargs)
Sep 9, 2021 @ 18:02:40.132synapse
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/message.py", line 943, in create_new_client_event
Sep 9, 2021 @ 18:02:40.132synapse
    assert (
Sep 9, 2021 @ 18:02:40.132synapse
AssertionError: Attempting to create an event with no prev_events

The admin continues to create and delete rooms (scripted based on number of API calls and them happening over a short period of time for the same rooms). Later sync starts failing for the admin operating on behalf of another user:

2021-09-10 00:20:08,523 - synapse.access.http.8009 - 389 - INFO - GET-27763- 76.104.170.72 - 8009 - {@admin:domain.tld.@someuser:domain.tld} Processed request: 0.088sec/-0.000sec (0.064sec, 0.008sec) (0.045sec/0.162sec/14) 55B 500 "GET /_matrix/client/r0/sync?filter=1&timeout=0&_cacheBuster=1631233209635&access_token=<redacted> HTTP/1.1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36" [0 dbevts]

2021-09-10 00:20:15,977 - synapse.http.server - 93 - ERROR - GET-27765- Failed handle request via 'SyncRestServlet': <XForwardedForRequest at 0x7fcd04c930c0 method='GET' uri='/_matrix/client/r0/sync?filter=1&timeout=0&_cacheBuster=1631233217123&access_token=<redacted>' clientproto='HTTP/1.1' site=8009>
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
Traceback (most recent call last):
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1657, in _inlineCallbacks
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    result = current_context.run(
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/twisted/python/failure.py", line 500, in throwExceptionIntoGenerator
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    return g.throw(self.type, self.value, self.tb)
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/sync.py", line 362, in _wait_for_sync_for_user
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    result: SyncResult = await self.current_sync_for_user(
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/sync.py", line 406, in current_sync_for_user
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    sync_result = await self.generate_sync_result(
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/sync.py", line 1072, in generate_sync_result
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    res = await self._generate_sync_entry_for_rooms(
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/sync.py", line 1524, in _generate_sync_entry_for_rooms
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    await concurrently_execute(handle_room_entries, room_entries, 10)
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    result = current_context.run(gen.send, result)
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/util/async_helpers.py", line 191, in _concurrently_execute_inner
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    await maybe_awaitable(func(value))
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/sync.py", line 1512, in handle_room_entries
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    res = await self._generate_room_entry(
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/sync.py", line 1936, in _generate_room_entry
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    batch = await self._load_filtered_recents(
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/sync.py", line 583, in _load_filtered_recents
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    loaded_recents = await filter_events_for_client(
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/visibility.py", line 84, in filter_events_for_client
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    event_id_to_state = await storage.state.get_state_for_events(
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/state.py", line 466, in get_state_for_events
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    group_to_state = await self.stores.state._get_state_for_groups(
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/databases/state/store.py", line 237, in _get_state_for_groups
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    (member_state, incomplete_groups_m,) = self._get_state_for_groups_using_cache(
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/databases/state/store.py", line 301, in _get_state_for_groups_using_cache
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    state_dict_ids, got_all = self._get_state_for_group_using_cache(
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/databases/state/store.py", line 185, in _get_state_for_group_using_cache
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
    cache_entry = cache.get(group)
Sep 10, 2021 @ 00:20:15.979synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/util/caches/dictionary_cache.py", line 96, in get
Sep 10, 2021 @ 00:20:15.980synapse-synchrotron
    return f(*args, **kwargs)
Sep 10, 2021 @ 00:20:15.980synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/util/caches/lrucache.py", line 483, in cache_get
Sep 10, 2021 @ 00:20:15.980synapse-synchrotron
    move_node_to_front(node)
Sep 10, 2021 @ 00:20:15.980synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/util/caches/lrucache.py", line 439, in move_node_to_front
Sep 10, 2021 @ 00:20:15.980synapse-synchrotron
    node.move_to_front(real_clock, list_root)
Sep 10, 2021 @ 00:20:15.980synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/util/caches/lrucache.py", line 294, in move_to_front
Sep 10, 2021 @ 00:20:15.980synapse-synchrotron
    self._list_node.move_after(cache_list_root)
Sep 10, 2021 @ 00:20:15.980synapse-synchrotron
  File "/usr/local/lib/python3.8/site-packages/synapse/util/linked_list.py", line 93, in move_after
Sep 10, 2021 @ 00:20:15.980synapse-synchrotron
    assert self.prev_node
Sep 10, 2021 @ 00:20:15.980synapse-synchrotron
AssertionError

Until Synapse is restarted, the event send graph looks completely dead.

image

Restart seems to have made everything happy again (no more 500, a good old 403 for event sending as we should).

Not sure if the admin would still see a sync error, but looking at stacktrace it's probably fixed by nuking the caches on restart?

Version information

  • Homeserver: EMS customer, please ping for host name and other details
  • Version: v1.42.0
  • Install method: EMS docker images
  • Workers: Synchrotron + initial synchrotron
@anoadragon453
Copy link
Member

anoadragon453 commented Sep 14, 2021

Looks like the suggestion in the linked issue (#10797) is a good way to go: aka deduplicating multiple requests to shutdown a room, and to make the API friendlier to long-running calls.

Flagging as to-discuss as this has come up in multiple occasions recently (though I have no reason to believe this is a new issue, just a potentially pressing one. Especially if it can be triggered easily by EMS customers).

@jaywink in the meantime, EMS may consider deduplicating the requests client-side until Synapse is more robust in this area.

@anoadragon453 anoadragon453 added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Discussion labels Sep 14, 2021
@callahad callahad added the P2 label Sep 16, 2021
@reivilibre
Copy link
Contributor

I haven't looked too deeply into this, but one suggestion might be to put a ResponseCache on it — we do this for various other REST endpoints to deduplicate requests.

@richvdh
Copy link
Member

richvdh commented May 19, 2022

I think this is either fixed by #11223, or a symptom of the more general #11521.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. z-p2 (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

6 participants