Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Riot/Web locks Chrome 80 solid for ~40s with ~12 cores of kernel #12417

Closed
ara4n opened this issue Feb 19, 2020 · 27 comments · Fixed by matrix-org/matrix-react-sdk#4111
Closed

Comments

@ara4n
Copy link
Member

ara4n commented Feb 19, 2020

Every hour or so I'm getting spectacular deadlocks on my account, which beachballs Chrome 80 on Catalina for 30-40s at a time. JS heap is ~300MB, but it looks like something may be leaking sprites? This began ~1 week ago.

Hitting the box with System Trace in Instruments (given Catalina has broken dtrace...) shows that Chrome is thrashing away doing dispatches which turn into kevents for managing a CoreImage cache?

Screenshot 2020-02-13 at 16 54 04

Screenshot 2020-02-18 at 19 48 30

@ara4n
Copy link
Member Author

ara4n commented Feb 20, 2020

Looks like a DOM leak - i have 127,134 detached div elements hanging around.

They look like:
image

I took a (1.5GB) heap dump.

The unreleased facebook/react#17666 looks like it might be related, so i'm trying on react@next.

@ara4n
Copy link
Member Author

ara4n commented Feb 22, 2020

The multi-minute pauses were driving me nuts, so I went on a leak mission with help from @t3chguy and @jryans. Conclusions are:

we have leaks in:

  • selectionchange in BasicMessageComposer (which leaks the whole DOM tree whenever you change rooms due to race between focus/blur)
  • pillifyLinks (which leaks the whole DOM tree whenever you render a pill in the timeline, via BaseAvatar's onClientSync for the avatar in the pill)
  • velociraptor (for RRs) (perhaps, although it might be just for RRs in flight)
  • ...and possibly everywhere we do a ratelimitedfunction or timed promise (at least until the timeout fires, although this might have been me failing to GC). I specifically noticed _updateE2eStatus in RoomTile and the LL _membersPromise in js-sdk.

This looks to be a dup of #10890.

This probably deserves a writeup similar to https://blog.discordapp.com/discord-react-memory-leak-565c89763e8 to explain the methodology and for future readers to avoid experiencing the same pain of working out how to drive dev tools to find the leak. My notes are:

  • take heap snapshots using the memory profiler.
  • pick something which really shouldn't ever leak (e.g. MessageComposer or RightPanel) - not detached HTMLDivElements
  • pick something which isn't leaking much (leaking 250K detached divs is hard to track, but 8 messagecomposers is much easier to reason about)
  • spot the granularity of what's leaking by looking at other objects which leak a similar number (e.g. 8 right panels, 8 messagecomposers, etc)
  • perhaps check when the leaks happen, using the allocation timeline profile mode of the memory profilers.
  • use roughly consistent repro steps (view the same series of rooms each time, in this instance)
  • hit the GC button before you capture a heap (is this really needed?)
  • look at the various top level retain chains for each leaked object, not just the first one (which by default will be the shortest path to the GC root, but may not be the most informative)
  • look at the reactInternalFiber objects in the retain graph ("reveal in summary window") to figure out precisely which React nodes are retaining
    • for instance, turns out that the BaseAvatar retaining the composer is consistently the ones for @thelink:mozilla.org and @haxen:matrix.org and me.
  • look at the __reactInternalInstance of the detached DIV in the retain graph
    • its outerText is: "14:36 TheLink: Layout problems in german i18n #12475 Layout problems in german i18n · Issue Layout problems in german i18n #12475 · vector-im/riot-web - GitHubTruncated buttons and overflowing labels."
    • this tells us what div has leaked; it's the BaseAvatar in the pill. Check where this pill comes from, and discover pillifyLinks manually messing with the DOM => leak.
  • comment out any dodgy looking retainers in the code (e.g. promises which claim to be holding onto the object). these could be real leaks or could be side-effects of the actual leak (i.e. legit stuff which would get cleaned up if the real leak wasn't leaking).
    • all our RateLimitFuncs might be leaking however, if they're not cleaned up properly on unmount?
    • likewise promises for things like LL loading?
  • when reading the retain chains, keep a close eye out for recurring bits of non-React code.
  • by elimination, find the real leak (in this instance, BasicMessageComposer's _onSelectionChange being leaked, holding onto a this which in turn retains the whole ReactDOM, due to sloppy handling of the selectionchange event listener).

stuff which didn't work:

@ara4n
Copy link
Member Author

ara4n commented Feb 22, 2020

@turt2live why is riot/web locking Chrome for a minute every few hours not a fire?

@turt2live
Copy link
Member

As far as I know it's only affecting 1 person and not widespread. It doesn't sound like a drop-everything-and-fix-it issue, but definitely one that should be p1 for when the maintainer role is out of fires to put out.

It was also missing from the board and presumed to be not a fire if it hasn't been tracked fully.

@ara4n
Copy link
Member Author

ara4n commented Feb 23, 2020

This should be affecting anyone with a relatively big account. I suspect it's got worse recently either due to a Chrome upgrade (such that it now takes out the whole browser rather than killing the tab), or possibly with a recent version of React 16, causing retaining a full DOM tree to have a larger footprint. I suspect it's also the root cause of some of the symptoms described at #10890.

So yes, i'm afraid that crashing my browser every few hours counts as a drop everything and fix it, which is why I dropped everything and fixed it.

The answer to it being missing from the board is to go put it on the board rather than unhelpfully untag it.

@turt2live
Copy link
Member

As someone with an updated Chrome and a large account, I can't say I've seen this as frequently as you've described.

@turt2live
Copy link
Member

(let the record show that I failed to read key parts of information which do in fact make this a fire rather than arbitrary performance issues)

@ara4n
Copy link
Member Author

ara4n commented Feb 23, 2020

thanks.

matrix-org/matrix-react-sdk#4111 fixes the two main leaks, but leaves Velociraptor alone. This seems to result in 3x additional DOM trees being leaked, although it's not immediately clear from the snapshot that it's Velociraptor which is retaining them. Empirically commenting out Velociraptor in RR's render method makes them go away.

Leaving this open until it's confirmed that 4111 fixes the main problem.

@ara4n
Copy link
Member Author

ara4n commented Feb 23, 2020

The Velociraptor leaks tend to look like this (when they show up), ftr:

Screenshot 2020-02-23 at 00 51 55

@aaronraimist
Copy link
Collaborator

Worth trying to update velocity? Looks like you guys forked it in 2015 and haven't updated it since.

@ara4n
Copy link
Member Author

ara4n commented Feb 23, 2020

hm, i thought we went back to mainline in matrix-org/matrix-react-sdk#2808.

i'm only seeing a few trees leaked from velocity atm (whereas previously they were leaking linearly over time), plus the plan as per #10890 (comment) is to remove it entirely. So not sure it's worth trying to patch up velocity further right now.

@ara4n
Copy link
Member Author

ara4n commented Feb 23, 2020

well, good news: with velociraptor commented out, i'm fairly reliably sitting with one or two messagecomposers (with the 2nd typically being linked to some promise/debounce/ratelimit which is hanging around, or possibly fiber's backbuffer).

bad news: with it enabled it's hovering between 4 and 6 leaked trees. not the end of the world, but annoying.

@ara4n
Copy link
Member Author

ara4n commented Feb 24, 2020

So, I've had one instance of Chrome deadlocking since this while running riot.im/develop. On snapshotting afterwards I found 10 leaked DOM trees unfortunately, all with a tree like this:

Screenshot 2020-02-24 at 22 41 40

The line of MessageComposer which retains is the ref of:

                <SendMessageComposer
                    ref={(c) => this.messageComposerInput = c}

...but I don't think this is necessarily a leak (although I do recall something about refs leaking).

I wasn't seeing this at all on a local branch with Velociraptor commented out, so I think it may be related - but tracking it here.

I suggest we keep this open (but de-fired) to gather more datapoints on tree leaks.

@ara4n
Copy link
Member Author

ara4n commented Feb 25, 2020

just had another; lasted 2 minutes:

Screenshot 2020-02-25 at 09 39 06

This time with 9x dom trees leaked (mix of velocity and the retain failure mode from the previous comment), but only 1200 detached divs (as opposed to the 250K odd I've seen in the past).

The Chrome task manager claims the tab is using 950MB of RAM total, and 280MB of GPU RAM - which seems relatively high but not crazy.

So, I think whatever the problem is still happening (although I am very tempted to suggest we take the time to remove velociraptor, otherwise debugging stuff like this is a nightmare).

@ara4n
Copy link
Member Author

ara4n commented Feb 25, 2020

...and again. feels like it's happening about as often as before the mem leak fixes \o/

@ara4n
Copy link
Member Author

ara4n commented Feb 25, 2020

...and again. dmesg has:

X86PlatformPlugin fSMC_PLimitNotification
X86PlatformPlugin fSMC_PLimitNotification
X86PlatformPlugin::evaluateHardLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
[IGFB][INFO   ] [Flip_Transaction] Disabling Render Compression
[IGFB][INFO   ] [Flip_Transaction] CCS Param update: AuxOffset:0x0 AuxStride:0x0
X86PlatformPlugin fSMC_PLimitNotification
X86PlatformPlugin fSMC_PLimitNotification
X86PlatformPlugin::evaluateHardLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateHardLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback
X86PlatformPlugin::evaluateSoftLimit NOT calling cgDispatchCallback

...which stopped once it recovered. This sounds like the intel GPU cocking up.

@ara4n
Copy link
Member Author

ara4n commented Feb 25, 2020

going to try resetting SMC :|

@ara4n
Copy link
Member Author

ara4n commented Feb 28, 2020

didn't help. 3 or 4 times a day, i spend my life staring at this. i'm not even sure it's riot/web at this point :/

Screenshot 2020-02-28 at 11 40 50

@jryans
Copy link
Collaborator

jryans commented Feb 28, 2020

What happens if you spend your day in not-Chrome as a test?

@ara4n
Copy link
Member Author

ara4n commented Feb 28, 2020

yup, already switched :)

@ara4n
Copy link
Member Author

ara4n commented Feb 28, 2020

...and it happened again. without any Riot/Web open.

I've switched back to Firefox, although will leave this open a bit longer just in case FF starts melting too.

@turt2live
Copy link
Member

also not fixed as far as I'm aware?

@turt2live turt2live reopened this Mar 6, 2020
@ara4n
Copy link
Member Author

ara4n commented Mar 23, 2020

Yes, it's not fixed - I just had it again after absentmindedly opening Riot in Chrome for the first time in a while.

(please don't close unfixed bugs...)

@ara4n
Copy link
Member Author

ara4n commented Apr 26, 2020

I've since had it a few times on other webapps without Riot being open in Chrome at all - in other words, this is some bad interaction between Chrome, Catalina, and possibly my GPU. Therefore this whole wild goose chase really looks to have nothing to do with Riot. (Meanwhile I've switched to Firefox and/or Riot Nightly when using Riot itself).

@ara4n ara4n closed this as completed Apr 26, 2020
@imofean
Copy link

imofean commented May 20, 2020

Hello my mac is experiencing same problem. Have you found a fix for that? I'm worried that It may caused by GPU malfunction or overheating.

@jryans
Copy link
Collaborator

jryans commented May 20, 2020

Hello my mac is experiencing same problem. Have you found a fix for that? I'm worried that It may caused by GPU malfunction or overheating.

I think it would be best for you to file a new issue with more details on what you're seeing, as could be something entirely different, system specific, etc.

@jakehemmerle
Copy link

jakehemmerle commented May 27, 2020

Almost certain it's a V8/macOS issue; running any Chromium/Electrum app causes heavy power consumption and lagging, including Chrome, Chromium, Brave (all Chromium), and VSCode (Electrum). Running macOS 10.15.4 (19E287) on a 2015 MBP.

t3chguy pushed a commit that referenced this issue Oct 17, 2024
* New context for local device verification

* Fix up tests

* Use PropsWithChildren
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants