-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Riot sometimes slowly leaks heap (was 'reliably hangs after 24h') #2951
Comments
I switched back from Electron for my main account in order to work around #2737, and was vaguely interested that the behaviour there is to 'Aw snap' when trying to catch up overnight. I've never seen an 'Aw snap' in Electron, so I suspect this another way this problem manifests. |
Saw this again, now on normal vector-web. On unsleeping Chrome had become spectacularly unresponsive, as if swapping - taking ~5s to respond on the JS console. Trying to do a heap dump caused it to 'Aw snap' at 17%. The stacktrace, fwiw, was:
|
before it crashed i got some more datapoints: it was trying to insert a timeline event for linux masterrace; the room had 11 timelines with 20 to 700 events each. the room list looked normal. my guess is that the heap was huge (proc was 1.7GB rsz) and chrome had hit some nasty v8 failure mode or deliberate throttling. @kegsay: is there any chance of pruning old events from ram whilst you are doing the indexeddb stuff? |
@ara4n : I can look at it afterwards if you would like, but this isn't a "kill 2 birds with one stone" scenario: the indexeddb stuff is far removed from pruning events from the timeline. |
this is really looking like the heap is exploding - i finally caught it almost-dead-but-not-quite and got a heapdump out of it (took about 20 mins to generate):
and also 93% of time in GC: Prior to this, the stack traces were showing 80% of time going into 'program', with the remainder going into browser-request callbacks, whilst doing E2E device syncing voodoo - possibly related to: #3127 and #3158. It seems fairly clear the problem here is the size of the heap though, and the fact that v8 becomes unstable somehow and spends its life GCing. |
next step is for me to dig into that heap dump. |
I think this is correct, but not obviously so. I regularly have large heaps (>4GB) on the IRC bridge which is using V8 under the hood, so the raw size of the heap is not the problem. I have seen many occasions where the V8 GC goes berserk and basically consumes all CPU preventing any JS from running, but this only happens when it approaches the limit of its heap size (which by default is ~1.5GB). I suspect you're hitting this failure mode (which is the exact same failure mode we hit on Freenode when we reach our limit). I theorise that this is the GC desperately trying not to OOM and is aggressively trying to clear objects, so much so that it makes the program unusable. Either way, the solution is the same: don't have such a big heap. |
rather splendidly the heapdump is too big to load into chrome to analyse :| |
so the heap dump has 500K events in it. from random sampling, they seem to be both presence events associated with users, as well as timeline events. it's very hard to tell if there's actually a leak here (beyond not pruning history) or if this just how busy my account really is. i can't find a programmatic way to analyse the heap like you can with objgraph in python, other than the useless https://www.npmjs.com/package/heapsnapshot-parser which barfs on a 1.5G heapdump. i'm kinda inclined to suggest pruning old history and see if it magically fixes itself. |
Is there a reason to keep presence events rather than just to compute current state from it? |
Something seems to have happened to have improved things a bunch. I just measured my Riot.app (0.9.8) after precisely 24h of use, and it "only" had 517MB of heap in use: ...and so I then restarted the app and did another capture, which yielded 591MB of heap(!): So I think that the problem has somehow gone away during kegan's indexeddb work, but I'm not entirely sure how or where. @kegsay: did you do something deliberate to fix the leak? @richvdh, did you spot anything when investigating earlier? I'm deprioritising in the meanwhile... |
given our event count on restart dropped from 278,672 to 274,553 (so 6MB retained size) it does look like there may still be a slow leak, but at 6MB per day that's not the end of the world. I'll monitor it. |
Yes, matrix-org/matrix-js-sdk#395, which drops old timelines when we get a limited sync (typically after suspending). |
I did stuff a while back like matrix-org/matrix-js-sdk#395 which will help for any long sleep periods which I guess you were hitting if:
is anything to go by. |
My Riot is starting to hang again, on my dev machine, after a day or two? I'll try and keep a closer eye on it... I restarted it at 9.45 this morning. |
matrix-org/matrix-js-sdk#395 fixed the worst of this, though we still have other problems as per #3307. |
Pretty much every time i unsleep my laptop in the morning, riot-web on electron has wedged, chewing 100% CPU and being entirely unresponsive - not redrawing the window, and not letting me switch to developer tools (even if i left them open). Menus work and there's no beachball of doom however.
This time I'd left dev tools open: the last logs from the app were at 16:05 (laptop spontaneously waking up, probably) and then:
...and then wedge solid. Inspector itself is responsive, however, and let me pause the VM in the debugger (but not query the DOM). The debugger gave a stacktrace of:
...i.e. tightlooping drawing the UI whilst adding events to timeline; this feels very similar to the deadlocks seen in the past in #2020.
The event it's trying to add to the room is:
(can't dump it from the debugger console itself for some reason).
Trying to profile then crashed the debugger entirely.
The text was updated successfully, but these errors were encountered: