-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A possible memory leak in callbacks/root #746
Comments
|
It might be the flat loop in Js causing the issue, never giving a chance for values to be dropped from scope. I'm not sure if async loops are oprmizied for that. Instead of valgrind, can you try Neon has a couple of unit tests to verify Line 51 in f3a96aa
|
I'd think that first too, but we're currently evaluating napi-rs and finding similar problems from their side too. Recently the same loop, after fixing all leaks, results a flat response from valgrind. I also took another try with prometheus/docker/grafana and monitoring |
@dherman I'm on vacation, if you could take a look that would be great, otherwise I'll look into this next week. |
@pimeys Thanks for the detective work! I'm curious, were the fixes in napi-rs ones you made locally or something they seem to have fixed in recent versions? @kjvalencik I'll see what I can figure out. |
There are a few, and their architecture are quite different compared to neon, so it's hard to compare. But yes, the changes were in their promise handling: napi-rs/napi-rs#579 This was mostly incorrect use of pointers, that lead to certain things not getting dropped. I quickly scanned the neon codebase too, but couldn't find anything similar... |
You can clone my test repo above, and run the loop. Already htop shows you how the RSS of node creeps up. It should not do that, the correct behavior should be going up about 30-40 megabytes and staying there until killed. Yes, we send lots of data. But all of this is something that the scavenge collector should be able to clean. What it looks like to me is some pointers somewhere still exist, with data in the heap. Rust cannot drop it, and the GC doesn't know about it, which triggers the massive usage of the heap. |
Confirmed there's no leak in the V8 heap with the Chrome inspector but |
I'm very interested on reading the fix! Do you have an idea how to test and detect leak regressions such as this? Would make sense maybe to |
Oh my approach so far has been cruder: I just added a console.log every 1000 iterations, ran it with KJ’s hypothesis was just from inspecting the code but it wasn’t the culprit (it was only relevant for N-API 6). There aren’t too many allocations involved in the event queue API so I’ve been instrumenting the data structures (EventQueue, Root, ThreadsafeFunction) with |
And those pesky |
There's at least one minor leak that was introduced to fix leaks on Root. :) But, that leak is napi-6 only. There's a manuallydrop that used to be only a single field, but an Arc was added. But, that's not the leak here. There might be another leak where only an inner box is being called from raw instead of the outer, which could be this. Unfortunately, I won't have my laptop until Tuesday and there's only so much I can do reading source on my phone. As far as automating it, I would appreciate recommendations. We have some tests to ensure Js values aren't leaked, but native code is harder. V8/Node itself is leaky and chooses to let some memory be freed by the process ending instead of freeing; that makes it difficult to automate valgrind. It's also difficult to simply look at process memory because V8 is greedy on allocations and doesn't free memory back to the OS often. It's usually pretty obvious to spot leaks manually, maybe adding some manual checks before releases would be useful? tl;dr The major leak appears to be related to EventQueue because memorizing it in a OnceCell appears to mostly resolve the issue. It also explains why I haven't observed it before since this usage is recommended for best performance (there's a proposal to inline this optimization for all users). |
This example could be easier to catch automatically. You call a function 100000 times, and in the end you should not be using hundreds of megabytes of RAM. Running a docker image controlled by the tests, you can monitor the resident set size in that image from the test, maybe that could work? But makes the test a bit harder to implement. I'm asking these because we in Prisma are planning to have some integration tests checking our memory usage, so I'm gathering good ideas all over the place. |
Thanks @pimeys. That's a good suggestion; it's easy to catch the really bad leaks, even if it won't catch small leaks. Its good not to let perfection get in the way of good. I would be interested in seeing how you automate it. Docker container with a strict memory limit seems fairly simple. |
Yeah, so this is quite easy because what napi in the end is, is a way to communicate between JS and Rust. And when you communicate, there is data that moves around. Make that data bigger, and you see leaks faster. Or call it often enough. Probably not that many cases where you have slow small leaks with this library. |
I found the issue and I'll get a PR up soon with a fix. The issue boils down to a misunderstanding of how N-API references work. I had interpreted that once the reference count hit https://nodejs.org/api/n-api.html#n_api_napi_delete_reference tl;dr -- It's not a bug in After this change, the process sits at 26MB indefinitely. |
It looks like this issue could also be in play. If Neon tracked its own reference count, it could delete instead of the last unref. |
Hi,
I was evaluating the new N-API codebase, and found quite fast greatly leaking piece of code. Here's the test repo that will show the leak:
https://github.com/pimeys/hello-neon
Now, I did some digging with Valgrind:
As you can see the culprit is in
napi_create_reference
, called fromRoot::new
. You can dig into the results from the includedmassif
dump! I was not able to collect this data, even by forcingglobal.gc()
, meaning we probably do something wrong in the native side.I'd expect this loop to have a steady flat memory profile, the data should be collected in the young-space already...
The text was updated successfully, but these errors were encountered: