-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Improve performance of remove_hidden_devices_from_device_inbox
#11420
Conversation
ORDER BY stream_id | ||
user_id >= ? | ||
AND hidden = ? | ||
ORDER BY user_id | ||
LIMIT ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this limit will mean that we may not find all the device_inbox
rows for a given user - we may only end up deleting half of them before deciding to move onto the next user.
I think a better strategy is to remove the join against device_inbox
, and just look for hidden devices, without worrying about whether they have device_inbox
rows. Then, do a DELETE FROM device_inbox
for each such device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to do this initially, but the original docstring mentions that it might be a bit heavy if the device has tons of pending messages in its inbox. For example, on abolivier.bzh's database, I've got devices with over 60k rows in device_inbox
(which aren't hidden devices but that's probably because I've already run the previous incarnation of this update), which sounds like a lot to delete in one go.
On top of that, I don't see how we might not delete entries for all devices for a given user, given the condition in the query is user_id >= ?
, so if we don't do every device from a user we should just do the rest in the next run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On top of that, I don't see how we might not delete entries for all devices for a given user, given the condition in the query is user_id >= ?, so if we don't do every device from a user we should just do the rest in the next run?
oh sorry, you're right.
However, from your analysis in the PR comment:
-> Index Only Scan using device_inbox_user_stream_id on device_inbox (cost=0.54..100217.89 rows=109879 width=42) (actual time=0.141..92.477 rows=101229 loops=1)
Heap Fetches: 6968
this is no good. It's a scan of the entirety of the device_inbox_user_stream_id
index: there is no index cond
here. It's acceptable for your local db with 100k rows, but won't be for a larger db.
Honestly, looking at this again, I think we're better off rewriting it again (sorry!) to do the same as #11421 (ie, walk through the device_inbox
table for a sequence of stream_ids. Hell, why not combine it with #11421?
other ideas...
- we can probably get away with deleting all the messages for a device at once, even if its in the 10s of thousands, though it gets really nasty given we'll be considering 100 devices on the first pass.
- maybe a two-stage loop: first find a device, then work through a range of
stream_ids
for that device on each run of the bg update. That gets annoyingly fiddly in terms of tracking state, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think you're right here, let's combine it with #11421
See #11401 for context.
By walking the
devices
table instead of thedevice_inbox
one.Here's the query details on abolivier.bzh's database:
So not ideal but looks like an improvement over its previous incarnation.