Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

synapse 1.47 big jump in load due to remove_hidden_devices_from_device_inbox #11401

Closed
2 tasks done
skepticalwaves opened this issue Nov 19, 2021 · 13 comments
Closed
2 tasks done
Labels
S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Release-Blocker Must be resolved before making a release

Comments

@skepticalwaves
Copy link

skepticalwaves commented Nov 19, 2021

Edit: Tracking expected mitigations in:

See discussion below for context


Description

After upgrading to synapse-1.47, via https://github.com/spantaleev/matrix-docker-ansible-deploy, my server experienced a huge increase in both CPU and IO load.

Upon examining the synapse prometheus/grafana stats, I found the following entires:

A large increase in master_0_background_updates in DB transactions
Screenshot_20211119_151116

And in particular, master-0_remove_hidden_device_from_inbox
image

It may be related, that the HTTP pusher distribution also changed oddly:
image

Steps to reproduce

  • upgraded from 1.46 to 1.47

Version information

@dklimpel
Copy link
Contributor

There is a background job to clean hidden devices from device_inbox.
Depending on the size of the table, this may take a while.

@skepticalwaves
Copy link
Author

Would that be:

Fix a long-standing bug where messages in the device_inbox table for deleted devices would persist indefinitely. Contributed by @dklimpel and @JohannesKleine. (#10969, #11212)?

@dklimpel
Copy link
Contributor

This job for deleted devices runs after.

The job for hidden devices is

Delete to_device messages for hidden devices that will never be read, reducing database size. (#11199)

@skepticalwaves
Copy link
Author

The last time such a change was published which was expected to produce unexpected load, it was documented,
https://matrix-org.github.io/synapse/develop/upgrade#re-indexing-of-events-table-on-postgres-databases

@skepticalwaves
Copy link
Author

2 days later, still have elevated load from this specific process

@richvdh richvdh changed the title synapse 1.47 big jump in load due to remove_hidden_device_from_inbox synapse 1.47 big jump in load due to remove_hidden_devices_from_device_inbox Nov 22, 2021
@richvdh
Copy link
Member

richvdh commented Nov 22, 2021

we've observed a similar problem on one of the hosts on EMS.

expected to produce unexpected load

unfortunately this unexpected load is unexpected.

@richvdh
Copy link
Member

richvdh commented Nov 22, 2021

I had some success by reducing MINIMUM_BACKGROUND_BATCH_SIZE to 1. It's possible to do that either by editing the source and restarting, or via the manhole with hs.get_datastores().databases[0].updates.MINIMUM_BACKGROUND_BATCH_SIZE = 1 (which won't persist over a restart).

@matrix-org/synapse-core: any reason not to reduce MINIMUM_BACKGROUND_BATCH_SIZE ?

@richvdh richvdh added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Nov 22, 2021
@erikjohnston
Copy link
Member

I had some success by reducing MINIMUM_BACKGROUND_BATCH_SIZE to 1. It's possible to do that either by editing the source and restarting, or via the manhole with hs.get_datastores().databases[0].updates.MINIMUM_BACKGROUND_BATCH_SIZE = 1 (which won't persist over a restart).

@matrix-org/synapse-core: any reason not to reduce MINIMUM_BACKGROUND_BATCH_SIZE ?

The reason for having a limit is to make sure decent amount of progress is made at each step, with a conscious trade off that its better to consume more resources and have the background update finish in a sensible time frame than effectively never have it finish at all. I don't have a particular objection with removing the minimum, but I do worry its going to not help that much if the underlying queries are sloooooooooow.

In this particular case it looks like the query is going slow:

matrix=> explain SELECT device_id, user_id, stream_id
                FROM device_inbox
                WHERE
                    stream_id >= 10000
                    AND (device_id, user_id) IN (
                        SELECT device_id, user_id FROM devices WHERE hidden = true
                    )
                ORDER BY stream_id
                LIMIT 100;
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1001.29..240940.97 rows=100 width=47)
   ->  Gather Merge  (cost=1001.29..7016749017.28 rows=2924380 width=47)
         Workers Planned: 2
         ->  Nested Loop  (cost=1.27..7016410471.24 rows=1218492 width=47)
               ->  Parallel Index Scan using device_inbox_stream_id_user_id on device_inbox  (cost=0.70..2394883832.61 rows=1236491797 width=47)
                     Index Cond: (stream_id >= 10000)
               ->  Index Scan using device_uniqueness on devices  (cost=0.56..3.74 rows=1 width=38)
                     Index Cond: ((user_id = device_inbox.user_id) AND (device_id = device_inbox.device_id))
                     Filter: hidden
(9 rows)

I think due to the DB having very few hidden devices, so its walking many rows of the device_inbox table before it finds an entry for a hidden device. We probably want to change it to find hidden devices first and then delete rows associated with that device.

@richvdh
Copy link
Member

richvdh commented Nov 23, 2021

a conscious trade off that its better to consume more resources and have the background update finish in a sensible time frame than effectively never have it finish at all.

If we have a slow query which needs optimising, it feels like it's better that it goes slowly (and we can optimise it in the next release) than that it takes out the entire homeserver by forging on and doing 100 rows anyway.

I think due to the DB having very few hidden devices, so its walking many rows of the device_inbox table before it finds an entry for a hidden device. We probably want to change it to find hidden devices first and then delete rows associated with that device.

That's a good idea, though I think we might have a similar problem with remove_deleted_devices_from_device_inbox which isn't going to be as amenable to the same technique.

@richvdh richvdh added the X-Release-Blocker Must be resolved before making a release label Nov 23, 2021
@richvdh
Copy link
Member

richvdh commented Nov 24, 2021

Tasks to do to resolve this: (edit, see PR description)

This is with @babolivier; he has the details of ideas for how to achieve them :)

@richvdh
Copy link
Member

richvdh commented Nov 26, 2021

I think we can close this now that #11421 and #11422 have landed. (For those afflicted: try 1.48.0rc1, it should be much better)

@richvdh richvdh closed this as completed Nov 26, 2021
@babolivier
Copy link
Contributor

Ah yeah I meant to close it but forgot, thanks.

@skepticalwaves
Copy link
Author

The load finally stopped and I see the queries have been optimized.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Release-Blocker Must be resolved before making a release
Projects
None yet
Development

No branches or pull requests

5 participants