Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Sync is timing out #12971

Closed
chagai95 opened this issue Jun 6, 2022 · 37 comments
Closed

Sync is timing out #12971

chagai95 opened this issue Jun 6, 2022 · 37 comments
Labels
A-Sync defects related to /sync

Comments

@chagai95
Copy link
Contributor

chagai95 commented Jun 6, 2022

Description

I can not use new devices because after the login is successful the sync times out.

Steps to reproduce

  • try to log in with my user with 1500 rooms

Version information

  • Homeserver:

If not matrix.org:
mx.chagai.website

  • Version:
    1.59.1
  • Install method:

docker ansible playbook

  • Platform:

CentOS

@chagai95
Copy link
Contributor Author

chagai95 commented Jun 6, 2022

Jun 06 18:09:56 localhost.localdomain matrix-synapse[5270]: 2022-06-06 16:09:56,567 - synapse.access.http.8008 - 427 - INFO - GET-1248 - 194.230.147.38 - 8008 - {@me:chagai.website} Processed request: 2.172sec/0.004sec (0.021sec, 0.002sec) (0.352sec/0.002sec/2) 309B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s2073217_17880630_0_1202172_224570_925_43437_12689616_1 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36" [0 dbevts]

Is the timeout variable set from Element? Maybe it needs to be higher?

@H-Shay
Copy link
Contributor

H-Shay commented Jun 6, 2022

I am curious as to what you are seeing that indicates the sync is timing out?

@chagai95
Copy link
Contributor Author

chagai95 commented Jun 6, 2022

I now understand it's normal for it to time out and keep on preparing the request in the background waiting for a new sync request to come until it is ready. The problem is it's not doing that for me in the background and I think I just hit the hardware limits I guess even though I have 8GB RAM and 4 cores, probably because of my old hardware I guess I need to get better hardware....

@H-Shay
Copy link
Contributor

H-Shay commented Jun 6, 2022

I'm asking because in the logs I see /sync requests that look like they are being successfully processed, so I am wondering what you are actually observing re: timeout. Does the client throw an error message saying that the sync times out? Does the spinner just spin endlessly? etc

@H-Shay H-Shay added A-Sync defects related to /sync X-Needs-Info This issue is blocked awaiting information from the reporter labels Jun 6, 2022
@chagai95
Copy link
Contributor Author

chagai95 commented Jun 6, 2022

The spinner spins endlessly, and I can see 504s in the network tab - I'll sent it on Element

@reivilibre
Copy link
Contributor

Please could you provide some logs, at INFO or DEBUG (preferably DEBUG), for multiple syncs over a long enough period of time (let's say >10 minutes for the sake of it)?
If you could include the associated reverse proxy logs for the same timespan, that may possibly be useful.

Initial sync is notoriously slow. That won't realistically be improved dramatically until sliding sync is supported — maybe some cache tuning or something could help a bit, but at 1500 rooms I wouldn't say it's unusual for it to time out a few times.

It's expected (but unfortunate) behaviour that the initial sync will time out multiple times, but the progress isn't lost and so eventually the client will retry and succeed.

@chagai95
Copy link
Contributor Author

chagai95 commented Jun 9, 2022

Hey, thanks, could i sent you the logs on matrix? I think I might have to reproduce to get the reverse proxy logs but I do have a few hours of synapse debug logs. Also I've been trying a few hours is that normal? Would moving to better hardware perhaps solve this issue? I tried tuning the caches as well...i tried 2.0 and 3.0 for the two room specific caches (don't remember the exact names)

@reivilibre
Copy link
Contributor

It would be nice to share some version of the logs (with personal info redacted if needed) in the issue so that others are able to benefit from the information (both other people investigating as well as other people who may have the problem in the future!).
I know it's a faff to make the logs private but it really does help with keeping the context of the issue in the issue. If this is really impossible for some reason then you can get in touch on Matrix but ideally it would be possible to make a few replacements.
Maybe it would be good to at least a relevant excerpt in the issue but you could send the rest to me, if that's easier — https://matrix.to/#/@reivilibre.element:librepush.net.

If you've been initial syncing for hours then I don't think that's expected and I don't think I'd suspect a hardware problem straight away. Let's see what the logs say. If you don't have the reverse proxy logs then let's just look at Synapse's logs to start with.

@reivilibre
Copy link
Contributor

Oops. I didn't know this wasn't configured by default, but there's an option for /sync's response cache and it defaults to 0(!):
I think you want to set sync_response_cache_duration, probably to 2m at least. That way, your initial sync will eventually make some forward progress :-).
See https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html#caching.

Let me know if this helps?

@chagai95
Copy link
Contributor Author

Thx! I just tried reproducing with 2m and I got the logs (proxy logs and console logs as well), I'll remove the personal stuff and upload them shortly, I also tried with 20m and it's still falling I'll try an hour or so and see what happens, thanks for pointing me to that config!

@chagai95
Copy link
Contributor Author

I can't figure out what I should hide and the file is 10,000 lines so I'm just going to upload it like this and hope for the best:
console.log
nginx.log
synapse.log
network.zip

@chagai95
Copy link
Contributor Author

I managed to log in! thx!! set the variable to 200m and I think after about half an hour - 40 minutes I was in. I also turned of presence, so I'll be turning that on again and trying again.

@reivilibre reivilibre reopened this Jun 14, 2022
@reivilibre
Copy link
Contributor

That sounds excessively long, so I'll reopen for now just so that we can have a look to see if anything is awry.

@reivilibre reivilibre removed the X-Needs-Info This issue is blocked awaiting information from the reporter label Jun 14, 2022
@chagai95
Copy link
Contributor Author

chagai95 commented Jun 14, 2022

The being slow might be my very slow hardware...

@chagai95
Copy link
Contributor Author

I tried again today from web and I didn't manage to log in again, should there be any difference between web and Android? I also turned presence back on, so maybe it's related to that? I think I didn't change anything else other than that and I didn't get debug logs this time but I can try again, or should I maybe try again without presence?
network-from-chrome-2hours.zip

@erikjohnston
Copy link
Member

Looking at the nginx logs it looks like some /sync just never completes, e.g. GET-292 in the Synapse logs.

@squahtx is it possible that the /sync (or its response cache) is getting cancelled when the client connection drops? /sync doesn't look like it's annotated as cancellable?

@squahtx
Copy link
Contributor

squahtx commented Jun 15, 2022

@squahtx is it possible that the /sync (or its response cache) is getting cancelled when the client connection drops? /sync doesn't look like it's annotated as cancellable?

Don't think so, we decided to not cancel /sync because of the caching.
Cancelled requests should show up with a status of 499! in the Synapse logs, not sure what nginx logs them as.

@chagai95
Copy link
Contributor Author

@chagai95
Copy link
Contributor Author

I somehow lost the console logs of the web but it tried more than 80 times and I have the network info from chrome zipped:
network-from-chrome-2hours.zip

@erikjohnston
Copy link
Member

@squahtx is it possible that the /sync (or its response cache) is getting cancelled when the client connection drops? /sync doesn't look like it's annotated as cancellable?

Don't think so, we decided to not cancel /sync because of the caching. Cancelled requests should show up with a status of 499! in the Synapse logs, not sure what nginx logs them as.

Actually, GET-292 says using incomplete cached result for ..., except I don't see any previous request for that key?!

@chagai95
Copy link
Contributor Author

I tried without presence, and it seems to not work, on Android this time (experiment-spaceSwitching)

@yuessir
Copy link

yuessir commented Jun 18, 2022

same problem duplicating is 0, but still stuck in sync

@chagai95
Copy link
Contributor Author

chagai95 commented Jun 18, 2022

I tried again from android I sent logs from android mentioning this ticket and I set debug for synapse and nginx , here are the logs and a screen cap of me doing it:

https://aarenet-my.sharepoint.com/:v:/p/chagai_friedlander/EcQ2T9MI8ydCir9iexg9a_8BCWx4H98-htyh2FVAPyIiKA

nginx-6-18.log
synapse-6-18.log

@anoadragon453
Copy link
Member

synapse-6-18.log

Hm, there's an instance in these logs where we're making use of a cached sync response:

Jun 18 12:27:36 localhost.localdomain matrix-synapse[25451]: 2022-06-18 10:27:36,080 - synapse.util.caches.response_cache - 256 - INFO - GET-18295 - [sync]: using incomplete cached result for [(@me:chagai.website, 0, 's2107773_17943238_70_1221115_232558_925_44077_12773301_1', '0', False, 'AGRYGTQDBH')]

However grepping for GET-18295, there's no further log lines for GET-18295. Is this expected?

@chagai95 I assume this Synapse instance does not have any workers other than the main process?

@chagai95
Copy link
Contributor Author

Yes I'm not using any workers.

@squahtx
Copy link
Contributor

squahtx commented Jul 12, 2022

nginx-6-18.log
synapse-6-18.log

I am having trouble making sense of these logs.
nginx-6-18.log has lots of epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream log lines for /sync, while synapse-6-18.log has no Connection from client lost before response was sent log lines at all, which suggests that the /sync requests that time out aren't even hitting Synapse.

Yes I'm not using any workers.

Can you explain your set up some more? Synapse requests are being routed to two different upstreams in nginx-6-18.log: http://172.18.0.9:8008 and http://172.18.0.11:12080

@squahtx squahtx added the X-Needs-Info This issue is blocked awaiting information from the reporter label Jul 12, 2022
@chagai95
Copy link
Contributor Author

I'm using the docker Ansible setup, are you familiar with that?

@squahtx
Copy link
Contributor

squahtx commented Jul 13, 2022

I'm using the docker Ansible setup, are you familiar with that?

Is it this one? https://github.com/spantaleev/matrix-docker-ansible-deploy

@squahtx squahtx removed the X-Needs-Info This issue is blocked awaiting information from the reporter label Jul 13, 2022
@chagai95
Copy link
Contributor Author

Yes

@squahtx
Copy link
Contributor

squahtx commented Jul 18, 2022

Can you explain your set up some more? Synapse requests are being routed to two different upstreams in nginx-6-18.log: http://172.18.0.9:8008 and http://172.18.0.11:12080

http://172.18.0.11:12080 turns out to also be the nginx reverse proxy. nginx proxies requests to itself, which then proxies requests on to Synapse. So the log lines for timed out requests come in pairs:


Jun 18 12:18:01 localhost.localdomain matrix-nginx-proxy[30618]: 2022/06/18 10:18:01 [info] 23#23: *39851 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream, client: 192.168.0.1, server: mx.chagai.website, request: "GET /_matrix/client/r0/sync?filter=%7B%22room%22%3A%7B%22state%22%3A%7B%22lazy_load_members%22%3Atrue%7D%7D%7D&set_presence=offline&timeout=0 HTTP/1.1", upstream: "http://172.18.0.11:12080/_matrix/client/r0/sync?filter=%7B%22room%22%3A%7B%22state%22%3A%7B%22lazy_load_members%22%3Atrue%7D%7D%7D&set_presence=offline&timeout=0", host: "mx.chagai.website"
Jun 18 12:18:01 localhost.localdomain matrix-nginx-proxy[30618]: 2022/06/18 10:18:01 [info] 22#22: *39856 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream, client: 172.18.0.11, server: matrix-nginx-proxy, request: "GET /_matrix/client/r0/sync?filter=%7B%22room%22%3A%7B%22state%22%3A%7B%22lazy_load_members%22%3Atrue%7D%7D%7D&set_presence=offline&timeout=0 HTTP/1.0", upstream: "http://172.18.0.9:8008/_matrix/client/r0/sync?filter=%7B%22room%22%3A%7B%22state%22%3A%7B%22lazy_load_members%22%3Atrue%7D%7D%7D&set_presence=offline&timeout=0", host: "mx.chagai.website"

synapse-6-18.log has no Connection from client lost before response was sent log lines at all, which suggests that the /sync requests that time out aren't even hitting Synapse.

Actually nginx might just be leaving connections open even though clients have disconnected.

@squahtx
Copy link
Contributor

squahtx commented Jul 18, 2022

nginx-6-18.log
synapse-6-18.log

There are a couple of long-running sync requests in there that don't finish by the end of the Synapse logs.
eg. GET-17130 is first seen at 2022-06-18 10:15:04,776 and last seen at 2022-06-18 10:41:46,954.

The timestamps in the logs also show frequent unexplained 20-30 second gaps where Synapse appears to be doing nothing. On average, there are only log lines for 10 seconds out of every minute. The pauses are probably the cause of the long sync times.

What does the CPU usage of Synapse look like? Are there any CPU quotas or memory limits applied to Synapse? Is the host out of memory?

@squahtx squahtx added the X-Needs-Info This issue is blocked awaiting information from the reporter label Jul 18, 2022
@chagai95
Copy link
Contributor Author

I am not sure how to check that, but yeah I think the CPU might be a bit overloaded

@squahtx
Copy link
Contributor

squahtx commented Jul 18, 2022

You could run top on the host and check what the first 5 lines say. eg.

top - 18:11:35 up 88 days, 12 min, 10 users,  load average: 0.78, 1.07, 1.00
Tasks: 491 total,   1 running, 489 sleeping,   0 stopped,   1 zombie
%Cpu(s):  3.8 us,  2.4 sy,  0.2 ni, 93.5 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  15747.0 total,    951.6 free,   7958.3 used,   6837.1 buff/cache
MiB Swap:    976.0 total,      0.0 free,    976.0 used.   7094.9 avail Mem

If %Cpu(s)'s id value is close to 0, then the CPU is probably overloaded.
and same for avail Mem, with memory instead.

@chagai95
Copy link
Contributor Author

I'd have to test this again soon and do that, thx!

@chagai95
Copy link
Contributor Author

Hey, so I switched to an ssd and I only tested after that so I can't compare, it seems to jump straight to zero when I try logging in but then goes up to numbers between 1-10 and sometimes very briefly back to 0 (0.1-0.9)

Thx for helping me check that!

avail Mem does go down a bit but I think there is more than enough left...

@chagai95
Copy link
Contributor Author

So the memory is down to 300,000 out of 2,400,000
That was pretty quick...

@chagai95
Copy link
Contributor Author

I'm logged in! So I guess it was the slow hdd after all, sorry for all the mess, thx! for helping me with this! Should I close the ticket or would you like to use it for other stuff?

P.s. I'm down to barely 100,000 avail Mem hope that's not too bad...

@H-Shay H-Shay removed the X-Needs-Info This issue is blocked awaiting information from the reporter label Aug 9, 2022
@H-Shay H-Shay closed this as completed Aug 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Sync defects related to /sync
Projects
None yet
Development

No branches or pull requests

7 participants