-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial sync may never complete if you have insufficient bandwidth to download the response before JS SDK knifes it #2737
Comments
As mentioned elsewhere, I believe, the vast majority of an initial sync request is in state events and almost all of those are regarding membership. In my mind, a not-too-complex solution could be:
The last point, for example with riot web, would already have the room list and unread count after the initial sync, which should take care of the room list. As a room is focused, one needs to backfill it with enough messages and state events to fill the screen. This could be done by sending requests to messages, state, or so, room endpoints to obtain only the required data that would be visible. Messages is paginated, state is not. The member list would also need to be populated. There is a room members endpoint but it is not paginated. Then one needs to get the avatars for those members in the member list that are visible and those members that sent messages to the room that are visible. I've probably missed some bits and pieces but I think that's about the minimum you could do. Incremental sync should be used if possible. It may be simpler to implement and support this via pagination but currently, as mentioned, state events cannot be filtered with the limit filter. Just my initial thoughts on minimizing data. I'm not suggesting it's necessarily a sensible solution. |
olm and the riot web bundle are about 2MB on /develop as well. |
If I understand correctly from the client-side your approach would look something like this:
I guess one concern I would have with this approach is that I think sync uses a single token to mark stream position (?) , so you cannot exactly tell On further thought events seem to be gathered in batches (by room_id): https://github.com/matrix-org/synapse/blob/master/synapse/storage/stream.py#L236 So it wouldn't require a lot of changes to enable specifying a |
Yes, except that
You at least know which rooms you are a member of but you don't know about other members.
Just a regular sync with a timeout and with no filter I think. Please note that all of this is my non-expert opinion. @erikjohnston @ara4n @richvdh may have opinions or corrections or so on all of this. |
The complex part with this kind of fragmented/divided asynchronous approach is going to be in implementing the management of the backfill requests (for timeline and state events) in a clean way on the client side. Without knowing about the Matrix JS and Riot React SDKs and maybe even the Riot web app, I wouldn't be surprised if it required quite a lot of changes at all levels. If I have correctly understood the separation of concerns - changes could be needed in the Matrix JS SDK or maybe the 'Structures' in the Riot React SDK to implement the logic of dynamic backfilling, in the Riot React SDK (in the 'Views'?) to trigger the backfilling from the components there when data is not present and to display something sensible (spinners and such) until the data is available. Not sure how much reusable code is there for this cause nor how much is already done/suitable with minor modification. Natural counter arguments for this are that it would need to be implemented in all client SDKs to benefit everyone, but once it's been figured out in one SDK, it is hopefully easier to port over the logic to others. It should be possible to implement as a backward-compatible UX improvement - you can just do a full initial sync and then go into long-polling if you want. Maybe maintaining the more complex implementation is tricky... but perhaps not? Depends how cleanly it can be implemented. |
That will filter after querying which is wasteful though, furthermore probably the check_valid_filter function should be fixed to raise an error when passed invalid filter params as well. I'm not sure whether
It should be simple if the components are modularly written, if they aren't that's a design issue that should probably be fixed anyways? Since the ability to render a responsive/useful GUI should be an independent concern from the methods by which the data is acquired? |
That is a synapse implementation detail and could potentially be improved at least for some cases.
Part of my proposed solution is in not obtaining all state but only that which you need (and possibly also only when you need it.) If any request to any endpoint that returns state is an 'all or nothing' decision then you don't improve on the situation with Matrix HQ where there is a couple of MB of state. That needs to be broken up somehow.
Absolutely. But then pragmatism and how much time people have to do things comes into play and maybe things aren't in the kind of shape the developers want but is in as good shape as they have had time for. Maybe everything is already awesome, I dunno. 😄 And I don't know who/how much time people have to look at something like this. If you have time, that's great! |
I literally cannot run the web-client with the current state of things, so it's kind of a show-stopper for me. |
i thought i'd already commented on this, but this bug has gone seriously off track. Completely agreed that initial sync needs to be faster, either by persisting state locally (bug #121), or filtering out irrelevant stuff (matrix-org/synapse#1782), or paginating the sync response per room (#1846), or paginating state somehow (#2914). This bug is intended to be a specific one describing @aviraldg's actual issue, which is:
The process only completes if you are lucky enough for the Nth request to land just before the server starts to stream the response, assuming the response is small enough to be transferred in 129s. The solution to this is probably just to track down where the js-sdk timeout is coming from and increase it - or at least increase it for handling the data response. N.B. that loadbalancers like nginx may also timeout in-flight requests with 504s exacerbating the problem too. Please let's not conflate all the other bugs together in here. Totally agreed that the sync shouldn't be so big anyway, but fixing that is for the other bugs. |
The client-side timeout is controlled here: https://github.com/matrix-org/matrix-js-sdk/blob/develop/lib/sync.js#L39 (it is added to the 30-second |
@pik: are you saying that your problem is the same as @aviraldg's and the sync takes ages on bad connections (e.g. 20 minutes) and keeps looping round as per my description in #2737 (comment), or just that several minutes is too long to wait for a /sync on a busy account? If the former, can you try increasing the timeout that @richvdh linked to and seeing if it helps? @aviraldg likewise? |
@ara4n @richvdh is correct that removing timeout will (at least as long as the request stays open) allow very bad connections to eventually sync and load riot-web. Which is to say that I still would consider it an open issue that it takes 20 minutes to display a useful UI on a slow-connection since the riot-web initial |
@pik yes, i totally agree that riot should lazy-load data. that is not this bug, it is #121, #1846, matrix-org/synapse#1782, and #2914. If the js-sdk timeout is sidestepped, /sync should take worst-case a few minutes on a very busy account. For instance, my account has 1000 rooms or so, and takes about 90 seconds for matrix.org to generate the sync response, and then another ~10s to download a few megabytes of gzipped JSON over. Where is your figure of 20 minutes coming from? I am trying to work out whether you are seeing a different performance regression too, and if my theory on fixing this bug by changing that timeout is correct or not. Please do not use this bug to discuss lazyloading /sync data. |
I think that's purely from the terrible connection and not any other kind of regression. |
I'm entirely failing to initial sync right now - the requests are failing with 504 and seemingly never succeeding. |
example logs of this failure mode (on relatively fast ski chalet wifi):
|
I just managed to log in after a mere 10 minutes of retrying after upping the timeout in js-sdk to 20 minutes. Unsure if this is coincidence or not, or if I just got lucky with the 504 timing:
|
It just took me 11 minutes to login, using the 20 min JS timeout.
|
...actually, fixing the timeout is easy and probably still p1 |
|
Instead of just using a timeout to reject ongoing requests, reset the timeout when progress is observed (at least when requests are done from browsers). This is to fix element-hq/element-web#2737
Instead of just using a timeout to reject ongoing requests, reset the timeout when progress is observed (at least when requests are done from browsers). This is to fix element-hq/element-web#2737
@aviraldg has complained about this many times, but we don't seem to have a dedicated bug. I was just bitten badly by it - it took me about 10 mins to eventually download my initial /sync on 3G, not because the request was 504ing but because the JS SDK was getting bored and knifing the request mid-download.
The logs look like this:
The timeout looks to be 129s or so.
The text was updated successfully, but these errors were encountered: