Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

many duplicate _get_state_groups_from_groups queries leading to OOMs #10301

Open
richvdh opened this issue Jul 2, 2021 · 24 comments
Open

many duplicate _get_state_groups_from_groups queries leading to OOMs #10301

richvdh opened this issue Jul 2, 2021 · 24 comments
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db S-Minor Blocks non-critical functionality, workarounds exist. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.

Comments

@richvdh
Copy link
Member

richvdh commented Jul 2, 2021

We had a federation sender instance which died. On restart, it rapidly consumed all available ram and OOMed again.

Inspection from postgres side shows it is doing many duplicate queries of the form

WITH RECURSIVE state(state_group) AS ( VALUES(3405820::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state ) ORDER BY type, state_key, state_group DESC

This query is _get_state_groups_from_groups, which is called from _get_state_for_groups. Although the latter has a cache, if many threads hit it at the same time, all will find the cache empty and go on to hit the database. I think we need a zero-timeout ResponseCache on _get_state_groups_from_groups.

@richvdh
Copy link
Member Author

richvdh commented Jul 8, 2021

this is currently complicated by the fact that the code does some batching of lookups. It's not obvious that the batching achieves much (at least on postgres) so we could maybe strip it out

@richvdh richvdh added S-Minor Blocks non-critical functionality, workarounds exist. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. labels Jul 22, 2021
@jaywink
Copy link
Member

jaywink commented Jul 30, 2021

Not sure if this could be related, but in EMS land we've noticed with the latest few synapse releases it seems that joining HQ sometimes puts a small host into an OOM loop, never recovering. Previously hosts took a much longer time joining HQ and OOMed for a while trying to do it, but eventually stabilized. Lately it feels like sometimes the host never recovers, until we give it more headroom in RAM.

The limit we have for these kind of small hosts is 1GB before the cluster kills the host to protect server stability. This issue we've seen lately is resolved by raising that limit to 1.5GB, which seems enough to process whatever it wants to (immediately and failing) to do.

@gergelypolonkai
Copy link
Contributor

During the last two days this impacts my single-user server, too. I’m not in big rooms like HQ (the biggest one has around 200 users). I can’t connect this to any events, like me joining a room or something, it just happened out of the blue; since then, i barely can send messages to any rooms, and my sync requests take a really long time if they succeed at all.

It also seems to “get to its senses” every now and then; during these times everything works like nothing happened, but it doesn’t take long, maybe a few minutes, before it goes back to PostgreSQL hell.

I tried reindexing and vacuuming my DB hoping that it will speed up these queries, but to no avail.

Until this gets fixed, is it safe (and useful) to downgrade to 1.46 (the version i used before 1.48.0)?

Also, if i can help with any debug data, let me know.

@daenney
Copy link
Contributor

daenney commented Dec 14, 2021

The issue predates 1.46, so I wouldn't assume downgrading is going to help.

@gergelypolonkai
Copy link
Contributor

Thatʼs strange, because our company HS works just fine with 1.46.

Also, does this mean thereʼs no workaround available? Anything one can do to use Synapse until it gets fixed?

@gergelypolonkai
Copy link
Contributor

FTR this also seems to affect the WhatsApp bridge somehow as not all my messages get forwarded to WhatsApp.

@reivilibre
Copy link
Contributor

@gergelypolonkai

Have you verified that the queries that are causing you trouble are the same as the ones that are in the description of this PR?
(I just want to make sure we're talking about the same issue and not perhaps about something similar.)

Although this issue existed before 1.46, it's always possible that something new has aggravated it further for you, so you're welcome to try and downgrade — we always try to keep at least 1 version of rollback possible. Synapse won't start up if you roll back too far, so it's harmless to return to 1.46 (or 1.47).

It seems like you can roll back to 1.46 (and further, if you wanted to) as the database is compatible. If you'd like to try and report back, that could be useful (and if you're lucky you might get your server back to give some time to try and investigate what's going on).

@gergelypolonkai
Copy link
Contributor

The only thing that doesn't match in my query is that ::bigint part; mine has 9690528 instead of 3405820.

@callahad
Copy link
Contributor

@gergelypolonkai Do you think you could try rolling back to 1.46 and seeing how that works for you?

@gergelypolonkai
Copy link
Contributor

I just did that, After starting it, it feels better (at least message sending doesn't timeout), but let me use it for a few hours before I jump to conclusions.

@gergelypolonkai
Copy link
Contributor

Nope, after like 15 minutes itʼs the same 😔

Let me know if i can help with anything else, iʼm happy to help when iʼm behind my keyboard.

@reivilibre
Copy link
Contributor

Thanks for trying! How are you installing Synapse? (wondering in case you'd be willing to try a branch to see if something improves the situation for you)

@gergelypolonkai
Copy link
Contributor

I’m using virtualenv/pip install on bare metal. So sure, shoot at me with the branch name and i can easily try.

@reivilibre
Copy link
Contributor

@gergelypolonkai The branch rei/p/stcache contains a way of deduplicating these queries, which might help. (Though I am surprised to see that you're having this issue on a single-user homeserver, to be honest!). It's been running on librepush.net since yesterday (with additional code that runs the old implementation and verifies they give the same result), plus I've tried to be reasonably paranoid with the testing. You're welcome to give it a try.

@gergelypolonkai
Copy link
Contributor

@reivilibre I’m also surprised that i’m affected not just because my HS is single-user, but because all the rooms i participate in are small (<250 users) and most of these rooms don’t have a long state history.

FTR, here’s what i used if someone with less Python-fu wants to give a try:

pip install git+https://github.com/matrix-org/synapse.git@rei/p/stcache

It installed smoothly and started up. I’ll check back within a few hours to let you know if it looks good from my server.

@gergelypolonkai
Copy link
Contributor

Sorry for not coming back earlier, yesterday i had a terrible migraine.

I can still see the query occasionally firing up (almost every time i switch rooms i Nheko). However, it feels smoother; at least sending messages isn’t slowed down, which is a great win in my book.

@foxcris

This comment has been minimized.

@daenney

This comment has been minimized.

@foxcris

This comment has been minimized.

@richvdh
Copy link
Member Author

richvdh commented Jan 7, 2022

This sounds like the recursive part of the query never gets to a point where it no longer returns a tuple, causing it to run forever. That would almost suggest some kind of cyclical relationship we're unable to break?

let's be clear: this issue is about the fact that we run multiple identical queries in parallel, which is inefficient, even if the queries themselves perform correctly.

If the queries aren't terminating (or if you're not seeing identical queries with identical constants in the VALUES(3405820::bigint) part), then it's a separate bug; please open a different issue. (Though see also #9826 and #7772, both of which may be related.)

@foxcris
Copy link

foxcris commented Jan 8, 2022

@richvdh: Thanks for the clarification. I also have multiple queries running but the constants are different. I will open a new issue for this then as they are not terminating.

@gergelypolonkai
Copy link
Contributor

@reivilibre does your branch get updated from master occasionally? I just upgraded to 1.54 and this issue persists; can i use your branch without essentially downgrading (not that i mind if it does mean a downgrade, though).

@gergelypolonkai
Copy link
Contributor

Upgrading to 1.54 and thus reverting this change brought a significant drawback in performance, so i reverted to this branch.

@reivilibre
Copy link
Contributor

reivilibre commented Apr 20, 2022

This was biting again, so I've updated the branch to 1.57.0 (as a new branch: rei/p/stcache-1.57; merely reverting #12126) to get us out of a pickle. Assuming this helps us out with the current situation, we may want to prioritise doing it properly and getting something akin to this solution mainlined.

Edit: it seemed to help a bit, but we still had OOMing afterwards.

@reivilibre reivilibre removed their assignment Apr 27, 2022
@MadLittleMods MadLittleMods added the A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db label Dec 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db S-Minor Blocks non-critical functionality, workarounds exist. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
10 participants