Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LL members can get out of sync with server on limited sync response #7211

Closed
ara4n opened this issue Aug 22, 2018 · 14 comments
Closed

LL members can get out of sync with server on limited sync response #7211

ara4n opened this issue Aug 22, 2018 · 14 comments

Comments

@ara4n
Copy link
Member

ara4n commented Aug 22, 2018

I invited uhoreg into a private dev room; he clearly joined, but in practice LL clients show him to still be invited. We see RRs from him though.

@bwindels
Copy link
Contributor

Can't repro in a room on localhost server with 4 joined + 1 invited members. I did notice another bug, where one of the joined members in the room forgot about the invited user. The invite event was in the room timeline, but not present in the room state nor the accumulated sync data. Seems like a Riot bug. Looking into it now, but these two issues couldn't be related you think?

@bwindels
Copy link
Contributor

bwindels commented Aug 29, 2018

UPDATE: see comment below first

The problem described above is also hard to reproduce, but I noticed one thing in the sync response after accepting an invite (so when the room appear under "join" for the first time): the state events are not complete. I don't receive the state at the point where I join. Instead I only get what seem to be the 3 first state events (m.room.create, m.room.member for room creator, m.room.power_levels). I would have expected the invite member event for another member to be in there that was invite just before me.
More concretely, the timeline looks like this, and the syncing user is Bruno3:

  • Bruno1 joined the room.
  • Bruno1 made future room history visible to all room members.
  • Bruno1 changed the room name to the happy few.
  • Bruno1 invited Bruno2.
  • Bruno1 invited Bruno5.
  • Bruno1 invited Bruno3.
  • Bruno2 joined the room.
  • Bruno3 joined the room.

Is this how it is supposed to work @ara4n ?
I tried to reproduce on develop with LL turned off but there state.events was just empty?
I thought you'd get the full room state (apart from joined members with LL turned on) after accepting a room, much like when doing an initial sync for a joined room? Maybe I'm missing something.

@bwindels
Copy link
Contributor

Creating a ticket for the issue described above, as it's probably not related to what happend with uhoreg where join event seems to have been fabricated. #7243

@ara4n
Copy link
Member Author

ara4n commented Sep 8, 2018

Right, looks like you might have found another bug with this "sync on joining is wrong".

Separately, I've reproduced the original bug - it looks to be a problem with limited syncs. To reproduce:

  • Alice turns on LL
  • Alice creates room
  • Bob joins room
  • Alice pauses her Riot
  • Bob invites Charlie
  • Charlie joins, and can even say something
  • Bob says 30 messages
  • Alice unpauses her Riot, and so gets a catchup (limited) /sync
  • Alice observes that the /sync has an empty sync block entirely.

This is very sytest-able and just a screwup in the impl, i guess; I didn't explicitly test limited syncs.

@ara4n
Copy link
Member Author

ara4n commented Sep 8, 2018

Actually, my issue isn't really synapse's fault. It's correct that the state is missing for the members who joined during the gap, given by definition if they're not in the timeline, and the current contract for LL is that you only get state for members who are in the timeline. Instead, it's an edge case that after a limited (aka gappy) sync, we would have to do a full bg resync of the members list to go and fill in the details of any changes which happened during the gap.

So, to fix this, either we:

  • Decide that gappy incremental syncs shouldn't lazy-load members (which is questionable; why are we loading members for things which aren't visible in the timeline?)
  • Re-trigger a bg members update after a gappy incremental sync. This has potential performance problems, as gappy syncs happen whenever a client is getting behind (either because of slow performance, slow network, or a hiatus in connectivity - e.g. laptop being slept). Background membership syncs are obviously quite heavy, so it seems like a bit of a pyrrhic victory to swap out an incremental state sync across the gap with a full resync because of this API decision.

I'm not sure what the right soln is, tbh. On one hand, just making synapse include the state during a gappy sync is kinda appealing as it will fix this without changes on the clients, and typically there won't be too much of a delta. On the other hand, most sophisticated clients (e.g. Riot on all platforms) only ever do initial sync at initial launch, and thereafter always do an incremental sync - so if they're offline for weeks and then relaunch, the relaunch could be delayed by needlessly pulling in tonnes of membership events.

Having written it out, i think we should optimise for the smallest possible syncs, given we might have better ways in future for pulling in the membership data selectively (rather than the very blunt hack of currently resyncing the whole membership list in the bg). In other words, responsivity is more important than bandwidth utilisation.

@ara4n
Copy link
Member Author

ara4n commented Sep 8, 2018

i've split out the other issue into #7303

@bwindels
Copy link
Contributor

bwindels commented Sep 10, 2018

@bmarty pointed me to the exact same issue on friday, suggesting we should refetch the members when receiving a gap in a timeline. I had it on my todo-list to check how riot-web handles this (e.g. if it always back-fills the gap, but that seems unlikely.)

@bwindels bwindels changed the title invited users may play badly with LL LL members can get out of sync with server on limited sync response Sep 10, 2018
@ara4n ara4n self-assigned this Sep 10, 2018
@ara4n
Copy link
Member Author

ara4n commented Sep 10, 2018

Conclusion from IRLing on this this morning is that we don't really have enough metrics on how fast/slow incremental sync acts in the wild in order to make an educated decision on how to address this.
So:

  • Matthew to add metrics on synapse to track how many sync responses are limited, and how many membership events they contain. I expect this to look like "incr sync for 40 rooms, 30 gaps, sent 243054 member events" in the synapse logs

meanwhile, we want to see just how slow incremental syncs are (in terms of the server processing & network bandwidth) time on the clients in general, for which we don't have metrics
i suggest we track on clients:

  • time taken for initial /sync and incr /sync to return
  • how many membership events there are, and how many rooms were limited?
  • how long it takes to process the /sync result
  • how big the /sync result is in bytes (compressed & uncompressed ideally)

in the short term, I'll disable LL on incremental syncs entirely, and so fix the bug like that.

and then once we've gathered metrics from both server & clients, we can decide what the correct solution actually based on the data, where are options are:

  • sync /members every time there's a gappy sync
  • give up on LL for incr sync (i.e. keep the status quo)
  • pass a since= param to /members (or use /sync) to keep the members list updated.

@ara4n
Copy link
Member Author

ara4n commented Sep 10, 2018

(i guess another option could be to ditch doing the bg /members and just pull in the profile data when needed. however, this doesn't play well with nick autocomplete, offline support, rapidly filtering the membership list and other places where you might need the full membership list)

ara4n added a commit to matrix-org/sytest that referenced this issue Sep 10, 2018
ara4n added a commit to matrix-org/synapse that referenced this issue Sep 10, 2018
ara4n added a commit to matrix-org/synapse that referenced this issue Sep 11, 2018
@lampholder lampholder added this to the RW013 milestone Sep 11, 2018
@bwindels
Copy link
Contributor

Remaining action point for riot-web is to add some tracking as described below:

so:
Matthew to add metrics on synapse to track how many sync responses are limited, and how many membership events they contain. I expect this to look like "incr sync for 40 rooms, 30 gaps, sent 243054 member events" in the synapse logs

meanwhile, we want to see just how slow incremental syncs are (in terms of the server processing & network bandwidth) time on the clients in general, for which we don't have metrics
i suggest we track on clients:

  • time taken for initial /sync and incr /sync to return
  • how many membership events there are, and how many rooms were limited?
  • how long it takes to process the /sync result

in the short term, I'll disable LL on incremental syncs entirely, and so fix the bug like that
and then once we've gathered metrics from both server & clients, we can decide what the correct solution actually based on the data, where are options are:

  • sync /members every time there's a gappy sync
  • give up on LL for incre sync
  • pass a since= param to /members (or use /sync) to keep the members list updated.

does that match everyone's understanding from the call just then?

@bwindels
Copy link
Contributor

Split of the above comment into #7327 as this is tracking the synapse work which is done.

@ara4n
Copy link
Member Author

ara4n commented Sep 24, 2018

This happened

@richvdh
Copy link
Member

richvdh commented Feb 23, 2024

For the record, this bug in Element was worked around by disabling LL for gappy syncs in Synapse (matrix-org/synapse#3840).

In other words: it's still a bug in Element, but it gets away with it because Synapse doesn't implement this particular bit of the spec. Other spec-compliant implementations could cause the bug to resurface. Indeed, there are reports of buggy behavior with Conduit which could be caused by such bugs.

@richvdh
Copy link
Member

richvdh commented Apr 4, 2024

For the record, this bug in Element was worked around by disabling LL for gappy syncs in Synapse (matrix-org/synapse#3840).

In other words: it's still a bug in Element, but it gets away with it because Synapse doesn't implement this particular bit of the spec. Other spec-compliant implementations could cause the bug to resurface. Indeed, there are reports of buggy behavior with Conduit which could be caused by such bugs.

I have filed #27285 about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants