Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Doing a rolling restart caused the typing worker to spin at 100% CPU until it was restarted #11750

Closed
reivilibre opened this issue Jan 14, 2022 · 6 comments · Fixed by #15332
Closed
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@reivilibre
Copy link
Contributor

reivilibre commented Jan 14, 2022

When performing a rolling restart of matrix.org (Updating 5cc41f1 .. 8e8a008), the typing1 worker suddenly started using a lot of CPU.
Judging by eye (on Grafana; not with these zoomed-out shots below that are more to give an idea of the usual expected trend), it looks like it started occurring just after the main process was restarted (the typing1 worker had not restarted by this point), but equally some synchrotrons and client readers were started at a similar time (and it's hard to see exactly since the granularity of the graph is not very fine).

It seems to coincide with a high rate of GET ReplicationGetStreamUpdates.

image

RDATA commands (particularly for caches and events) seem to be queueing up (and that sounds bad):

image

The oscillatory 'Replication connections' graph seems strange — I don't know what it means exactly as of right now but it's something that might be worth looking into.

I restarted the worker myself (at approximately 12:44 UTC) to get it to return to normalcy.

Here is an excerpt of the schedule for the worker restarts, in case that's useful:

2022-01-14 12:03:00,725 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart synapse-main.service
2022-01-14 12:03:40,797 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart [email protected]
2022-01-14 12:04:15,073 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart [email protected]
2022-01-14 12:04:49,849 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart synapse-worker@client_reader1.service
2022-01-14 12:05:23,509 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart [email protected]
2022-01-14 12:05:57,697 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart [email protected]
2022-01-14 12:06:32,614 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart [email protected]
2022-01-14 12:07:07,650 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart synapse-worker@client_reader2.service
2022-01-14 12:07:43,080 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart [email protected]
@reivilibre reivilibre self-assigned this Jan 20, 2022
@richvdh
Copy link
Member

richvdh commented Jan 27, 2022

I think this is related to #8524

@H-Shay
Copy link
Contributor

H-Shay commented Jan 28, 2022

I took the tiniest stab at this but was overwhelmed with the size of the log and the fact that I wasn't really sure what I was looking for -it was a little like looking for a needle in a haystack when you don't know what a needle looks like. But I did find some patterns after the rolling restart (at 12:03) that looked like this:

2022-01-14 12:06:53,067 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239
2022-01-14 12:06:53,385 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239
2022-01-14 12:06:53,792 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239
2022-01-14 12:06:54,652 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239
2022-01-14 12:06:55,100 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239
2022-01-14 12:06:55,728 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607635 and 409608239
2022-01-14 12:06:56,137 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409608091 and 409608239
2022-01-14 12:06:56,556 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 408653351 and 409608209
2022-01-14 12:06:57,088 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239
2022-01-14 12:06:57,467 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239
2022-01-14 12:06:57,807 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239

Specifically, you can see that the tcp handler is attempting to fetch the same rows over and over for the cache (in this case, the rows between 409607869 and 409608239, but there are many other examples of rows it seemed to get stuck on). I have no idea why this happened but it does explain the graphs a little (cpu getting chewed up trying to fetch the same thing while the other queues pile up). This might be a total red herring but I thought I'd mention it because the logs on this will be rotated out very shortly. Oh and ftr this view is the log after filtering out everything but "fetching replication rows" in case that matters.

@clokep clokep added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Feb 1, 2022
@reivilibre
Copy link
Contributor Author

Nothing conclusive, just suspicious and making a note.

typing starting at 0 doesn't seem super related, but then this code below makes it look like 'stream catchup' behaviour gets triggered if the stream goes backwards.

        missing_updates = cmd.prev_token != current_token
        while missing_updates:
            logger.info(
                "Fetching replication rows for '%s' between %i and %i",
                stream_name,
                current_token,
                cmd.new_token,
            )

@reivilibre
Copy link
Contributor Author

I was a little wiser than usual and took a py-spy --gil of the process today whilst it was spinning away.

20% of time each was going to each of these lines: 426, 427 and 434 in get_all_typing_updates:

] = self._typing_stream_change_cache.get_all_entities_changed(last_id)
if changed_rooms is None:
changed_rooms = self._room_serials
rows = []
for room_id in changed_rooms:
serial = self._room_serials[room_id]
if last_id < serial <= current_id:
typing = self._room_typing[room_id]
rows.append((serial, [room_id, list(typing)]))
rows.sort()
limited = False
# We, unusually, use a strict limit here as we have all the rows in
# memory rather than pulling them out of the database with a `LIMIT ?`
# clause.
if len(rows) > limit:
rows = rows[:limit]
current_id = rows[-1][0]
limited = True
return rows, current_id, limited
def process_replication_rows(
self, token: int, rows: List[TypingStream.TypingStreamRow]
) -> None:
# The writing process should never get updates from replication.
raise Exception("Typing writer instance got typing info over replication")

Of course that's just a symptom — probably of the fact that the position resets back to zero, but at least mildly interesting to see what's costly here.

@reivilibre reivilibre removed their assignment Apr 7, 2022
@DMRobertson
Copy link
Contributor

Have we seen this recently? If not I'd be inclined to close unless it resurfaces.

@DMRobertson
Copy link
Contributor

Looks like it:

Screenshot from 2022-05-09 15-44-29

erikjohnston added a commit that referenced this issue Mar 27, 2023
Rather than keeping them around forever in memory, slowing things down.

Fixes #11750.
erikjohnston added a commit that referenced this issue Mar 27, 2023
Rather than keeping them around forever in memory, slowing things down.

Fixes #11750.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants