Doing a rolling restart caused the typing worker to spin at 100% CPU until it was restarted #11750

reivilibre · 2022-01-14T13:10:08Z

When performing a rolling restart of matrix.org (Updating 5cc41f1 .. 8e8a008), the typing1 worker suddenly started using a lot of CPU.
Judging by eye (on Grafana; not with these zoomed-out shots below that are more to give an idea of the usual expected trend), it looks like it started occurring just after the main process was restarted (the typing1 worker had not restarted by this point), but equally some synchrotrons and client readers were started at a similar time (and it's hard to see exactly since the granularity of the graph is not very fine).

It seems to coincide with a high rate of GET ReplicationGetStreamUpdates.

RDATA commands (particularly for caches and events) seem to be queueing up (and that sounds bad):

The oscillatory 'Replication connections' graph seems strange — I don't know what it means exactly as of right now but it's something that might be worth looking into.

I restarted the worker myself (at approximately 12:44 UTC) to get it to return to normalcy.

Here is an excerpt of the schedule for the worker restarts, in case that's useful:

2022-01-14 12:03:00,725 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart synapse-main.service
2022-01-14 12:03:40,797 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart [email protected]
2022-01-14 12:04:15,073 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart [email protected]
2022-01-14 12:04:49,849 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart synapse-worker@client_reader1.service
2022-01-14 12:05:23,509 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart [email protected]
2022-01-14 12:05:57,697 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart [email protected]
2022-01-14 12:06:32,614 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart [email protected]
2022-01-14 12:07:07,650 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart synapse-worker@client_reader2.service
2022-01-14 12:07:43,080 - ssh.hippogriff - 42 - INFO - $ systemctl --user restart [email protected]

The text was updated successfully, but these errors were encountered:

richvdh · 2022-01-27T14:18:26Z

I think this is related to #8524

H-Shay · 2022-01-28T20:50:38Z

I took the tiniest stab at this but was overwhelmed with the size of the log and the fact that I wasn't really sure what I was looking for -it was a little like looking for a needle in a haystack when you don't know what a needle looks like. But I did find some patterns after the rolling restart (at 12:03) that looked like this:

2022-01-14 12:06:53,067 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239
2022-01-14 12:06:53,385 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239
2022-01-14 12:06:53,792 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239
2022-01-14 12:06:54,652 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239
2022-01-14 12:06:55,100 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239
2022-01-14 12:06:55,728 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607635 and 409608239
2022-01-14 12:06:56,137 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409608091 and 409608239
2022-01-14 12:06:56,556 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 408653351 and 409608209
2022-01-14 12:06:57,088 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239
2022-01-14 12:06:57,467 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239
2022-01-14 12:06:57,807 - synapse.replication.tcp.handler - 554 - INFO - process-replication-data-23562530 - Fetching replication rows for 'caches' between 409607869 and 409608239

Specifically, you can see that the tcp handler is attempting to fetch the same rows over and over for the cache (in this case, the rows between 409607869 and 409608239, but there are many other examples of rows it seemed to get stuck on). I have no idea why this happened but it does explain the graphs a little (cpu getting chewed up trying to fetch the same thing while the other queues pile up). This might be a total red herring but I thought I'd mention it because the logs on this will be rotated out very shortly. Oh and ftr this view is the log after filtering out everything but "fetching replication rows" in case that matters.

reivilibre · 2022-02-08T18:34:57Z

Nothing conclusive, just suspicious and making a note.

typing starting at 0 doesn't seem super related, but then this code below makes it look like 'stream catchup' behaviour gets triggered if the stream goes backwards.

        missing_updates = cmd.prev_token != current_token
        while missing_updates:
            logger.info(
                "Fetching replication rows for '%s' between %i and %i",
                stream_name,
                current_token,
                cmd.new_token,
            )

reivilibre · 2022-03-08T14:35:49Z

I was a little wiser than usual and took a py-spy --gil of the process today whilst it was spinning away.

20% of time each was going to each of these lines: 426, 427 and 434 in get_all_typing_updates:

synapse/synapse/handlers/typing.py

Lines 416 to 444 in e24ff8e

    
               ] = self._typing_stream_change_cache.get_all_entities_changed(last_id) 
        
               if changed_rooms is None: 
        
                   changed_rooms = self._room_serials 
        
               rows = [] 
        
               for room_id in changed_rooms: 
        
                   serial = self._room_serials[room_id] 
        
                   if last_id < serial <= current_id: 
        
                       typing = self._room_typing[room_id] 
        
                       rows.append((serial, [room_id, list(typing)])) 
        
               rows.sort() 
        
               limited = False 
        
               # We, unusually, use a strict limit here as we have all the rows in 
        
               # memory rather than pulling them out of the database with a `LIMIT ?` 
        
               # clause. 
        
               if len(rows) > limit: 
        
                   rows = rows[:limit] 
        
                   current_id = rows[-1][0] 
        
                   limited = True 
        
               return rows, current_id, limited 
        
           def process_replication_rows( 
        
               self, token: int, rows: List[TypingStream.TypingStreamRow] 
        
           ) -> None: 
        
               # The writing process should never get updates from replication. 
        
               raise Exception("Typing writer instance got typing info over replication")

Of course that's just a symptom — probably of the fact that the position resets back to zero, but at least mildly interesting to see what's costly here.

DMRobertson · 2022-05-09T14:36:59Z

Have we seen this recently? If not I'd be inclined to close unless it resurfaces.

DMRobertson · 2022-05-09T14:45:06Z

Looks like it:

Rather than keeping them around forever in memory, slowing things down. Fixes #11750.

reivilibre self-assigned this Jan 20, 2022

clokep added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Feb 1, 2022

reivilibre removed their assignment Apr 7, 2022

erikjohnston added a commit that referenced this issue Mar 27, 2023

Prune old typing notifications

356615d

Rather than keeping them around forever in memory, slowing things down. Fixes #11750.

erikjohnston mentioned this issue Mar 27, 2023

Prune old typing notifications #15332

Merged

erikjohnston closed this as completed in #15332 Mar 27, 2023

erikjohnston added a commit that referenced this issue Mar 27, 2023

Prune old typing notifications (#15332)

96f163d

Rather than keeping them around forever in memory, slowing things down. Fixes #11750.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doing a rolling restart caused the typing worker to spin at 100% CPU until it was restarted #11750

Doing a rolling restart caused the typing worker to spin at 100% CPU until it was restarted #11750

reivilibre commented Jan 14, 2022 •

edited

Loading

richvdh commented Jan 27, 2022

H-Shay commented Jan 28, 2022

reivilibre commented Feb 8, 2022

reivilibre commented Mar 8, 2022

DMRobertson commented May 9, 2022

DMRobertson commented May 9, 2022

Doing a rolling restart caused the typing worker to spin at 100% CPU until it was restarted #11750

Doing a rolling restart caused the typing worker to spin at 100% CPU until it was restarted #11750

Comments

reivilibre commented Jan 14, 2022 • edited Loading

richvdh commented Jan 27, 2022

H-Shay commented Jan 28, 2022

reivilibre commented Feb 8, 2022

reivilibre commented Mar 8, 2022

DMRobertson commented May 9, 2022

DMRobertson commented May 9, 2022

reivilibre commented Jan 14, 2022 •

edited

Loading