Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A possible memory leak in callbacks/root #746

Closed
pimeys opened this issue May 27, 2021 · 17 comments
Closed

A possible memory leak in callbacks/root #746

pimeys opened this issue May 27, 2021 · 17 comments

Comments

@pimeys
Copy link

pimeys commented May 27, 2021

Hi,

I was evaluating the new N-API codebase, and found quite fast greatly leaking piece of code. Here's the test repo that will show the leak:

https://github.com/pimeys/hello-neon

Now, I did some digging with Valgrind:
leak

As you can see the culprit is in napi_create_reference, called from Root::new. You can dig into the results from the included massif dump! I was not able to collect this data, even by forcing global.gc(), meaning we probably do something wrong in the native side.

I'd expect this loop to have a steady flat memory profile, the data should be collected in the young-space already...

@pimeys
Copy link
Author

pimeys commented May 27, 2021

n7: 6550942 (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
      
      
         n1: 4627840 0x91FF79: napi_create_reference (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
          n1: 4627840 0x9E84B63: neon_runtime::napi::bindings::functions::napi1::create_reference (mod.rs:163)
      
      
           n1: 4627840 0x9E7C52D: neon_runtime::napi::reference::new (reference.rs:11)
      
      
            n1: 4627840 0x9DD8727: neon::handle::root::Root<T>::new (root.rs:65)
      
      
             n1: 4627840 0x9DDD60F: neon::object::traits::Object::root (mod.rs:238)
      
      
              n1: 4627840 0x9DCF08D: hello_neon::QueryEngine::js_query (lib.rs:60)
      
      
               n1: 4627840 0x9DCB95A: <neon::types::internal::FunctionCallback<T> as neon::types::internal::Callback<*mut neon_runtime::napi::bindings::types::Value__>>::invoke::{{closure}}::{{closure}} (internal.rs:62)
      
      
                n1: 4627840 0x9DC5E82: neon::types::error::convert_panics::{{closure}} (error.rs:86)
      
      
                 n1: 4627840 0x9DE3404: std::panicking::try::do_call (panicking.rs:379)
      
      
                  n1: 4627840 0x9DE385C: __rust_try (in /home/pimeys/code/hello-neon/index.node)
      
      
                   n1: 4627840 0x9DE2B9D: std::panicking::try (panicking.rs:343)
      
      
                    n1: 4627840 0x9DD0ADD: std::panic::catch_unwind (panic.rs:431)
      
      
                     n1: 4627840 0x9DC530C: neon::types::error::convert_panics (error.rs:86)
      
      
                      n1: 4627840 0x9DCB5F8: <neon::types::internal::FunctionCallback<T> as neon::types::internal::Callback<*mut neon_runtime::napi::bindings::types::Value__>>::invoke::{{closure}} (internal.rs:62)
      
      
                       n1: 4627840 0x9DE129E: neon::context::CallContext<T>::with::{{closure}} (mod.rs:710)
      
      
                        n1: 4627840 0x9DC954A: neon::context::internal::Scope<R>::with (internal.rs:99)
      
      
                         n1: 4627840 0x9DE0F7E: neon::context::CallContext<T>::with (mod.rs:709)
      
      
                          n1: 4627840 0x9DE042B: neon::context::CallbackInfo::with_cx (mod.rs:204)
      
      
                           n1: 4627840 0x9DCB1F8: <neon::types::internal::FunctionCallback<T> as neon::types::internal::Callback<*mut neon_runtime::napi::bindings::types::Value__>>::invoke (internal.rs:58)
      
      
                            n1: 4627840 0x917D63: v8impl::(anonymous namespace)::FunctionCallbackWrapper::Invoke(v8::FunctionCallbackInfo<v8::Value> const&) (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                             n1: 4627840 0xB79761: v8::internal::FunctionCallbackArguments::Call(v8::internal::CallHandlerInfo) (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                              n1: 4627840 0xB79B1C: v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                               n1: 4627840 0xB7A379: v8::internal::Builtin_Impl_HandleApiCall(v8::internal::BuiltinArguments, v8::internal::Isolate*) (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                n1: 4627840 0xB7AD48: v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                 n4: 4627840 0x13EDC78: Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_BuiltinExit (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                  n1: 3918080 0x3B1F7A804DC1: ???
      
      
                                   n1: 3918080 0x13B360F: Builtins_AsyncFunctionAwaitResolveClosure (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                    n1: 3918080 0x142F90D: Builtins_PromiseFulfillReactionJob (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                     n1: 3918080 0x13A6739: Builtins_RunMicrotasks (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                      n0: 3918080 0x13846B7: Builtins_JSRunMicrotasksEntry (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                  n1: 289840 0x1386A41: Builtins_InterpreterEntryTrampoline (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                   n1: 289840 0x142DEDF: Builtins_PromiseConstructor (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                    n1: 289840 0x1382BE9: Builtins_JSBuiltinsConstructStub (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                     n1: 289840 0x146BF42: Builtins_ConstructHandler (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                      n0: 289840 0x1386A41: Builtins_InterpreterEntryTrampoline (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                  n1: 214640 0x3B1F7A8045E6: ???
      
      
                                   n1: 214640 0x1386A41: Builtins_InterpreterEntryTrampoline (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                    n1: 214640 0x13B360F: Builtins_AsyncFunctionAwaitResolveClosure (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                     n1: 214640 0x142F90D: Builtins_PromiseFulfillReactionJob (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                      n0: 214640 0x13A6739: Builtins_RunMicrotasks (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                  n1: 205280 0x3B1F7A803EC6: ???
      
      
                                   n1: 205280 0x1386A41: Builtins_InterpreterEntryTrampoline (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                    n1: 205280 0x1386A41: Builtins_InterpreterEntryTrampoline (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                     n1: 205280 0x13B360F: Builtins_AsyncFunctionAwaitResolveClosure (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                      n0: 205280 
0x142F90D: Builtins_PromiseFulfillReactionJob (in 
/nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)

@kjvalencik
Copy link
Member

It might be the flat loop in Js causing the issue, never giving a chance for values to be dropped from scope. I'm not sure if async loops are oprmizied for that.

Instead of valgrind, can you try node --inspect and the Chrome debugger to take a heap dump? That will give a clearer picture of the objects being held and what is holding the references.

Neon has a couple of unit tests to verify Root do not leak:

it('should run callback on drop', function (cb) {

@pimeys
Copy link
Author

pimeys commented May 27, 2021

I'd think that first too, but we're currently evaluating napi-rs and finding similar problems from their side too. Recently the same loop, after fixing all leaks, results a flat response from valgrind.

I also took another try with prometheus/docker/grafana and monitoring docker stats. When I run this loop with neon, the memory graph goes gigabytes and gigabytes up until I kill the process. With napi-rs, using the latest fixes, I'm staying in healthy 30 megabytes of RAM all the time.

@kjvalencik
Copy link
Member

@dherman I'm on vacation, if you could take a look that would be great, otherwise I'll look into this next week.

@dherman
Copy link
Collaborator

dherman commented May 27, 2021

@pimeys Thanks for the detective work! I'm curious, were the fixes in napi-rs ones you made locally or something they seem to have fixed in recent versions?

@kjvalencik I'll see what I can figure out.

@pimeys
Copy link
Author

pimeys commented May 27, 2021

There are a few, and their architecture are quite different compared to neon, so it's hard to compare. But yes, the changes were in their promise handling: napi-rs/napi-rs#579

This was mostly incorrect use of pointers, that lead to certain things not getting dropped. I quickly scanned the neon codebase too, but couldn't find anything similar...

@pimeys
Copy link
Author

pimeys commented May 27, 2021

You can clone my test repo above, and run the loop. Already htop shows you how the RSS of node creeps up. It should not do that, the correct behavior should be going up about 30-40 megabytes and staying there until killed.

Yes, we send lots of data. But all of this is something that the scavenge collector should be able to clean. What it looks like to me is some pointers somewhere still exist, with data in the heap. Rust cannot drop it, and the GC doesn't know about it, which triggers the massive usage of the heap.

@dherman
Copy link
Collaborator

dherman commented May 27, 2021

Confirmed there's no leak in the V8 heap with the Chrome inspector but top shows the steady process memory growth. @kjvalencik found a possible source of the leak but I have to run errands for a bit, we'll test out fixes later this afternoon. I'll keep the thread updated with our progress…

@pimeys
Copy link
Author

pimeys commented May 28, 2021

I'm very interested on reading the fix! Do you have an idea how to test and detect leak regressions such as this? Would make sense maybe to ps the resident set first, then call a function for N times and ps again, diffing and throwing if too high. You can't really trust the javascript metrics here, if v8 doesn't know about the leaked memory.

@dherman
Copy link
Collaborator

dherman commented May 28, 2021

Oh my approach so far has been cruder: I just added a console.log every 1000 iterations, ran it with node --inspect index.JS and opened the Chrome devtools and manually took heap snapshots periodically while watching the console output scroll by. The devtools showed the heap size staying around 3MB consistently in every snapshot.

KJ’s hypothesis was just from inspecting the code but it wasn’t the culprit (it was only relevant for N-API 6).

There aren’t too many allocations involved in the event queue API so I’ve been instrumenting the data structures (EventQueue, Root, ThreadsafeFunction) with Drop implementations that print when they get called, to check that they’re getting dropped the right number of times, but it hasn’t turned up any misbehavior so far. Looking at the Box allocations next...

@pimeys
Copy link
Author

pimeys commented May 28, 2021

And those pesky ManuallyDrop uses. Easy to use them wrong...

@kjvalencik
Copy link
Member

kjvalencik commented May 28, 2021

There's at least one minor leak that was introduced to fix leaks on Root. :) But, that leak is napi-6 only. There's a manuallydrop that used to be only a single field, but an Arc was added. But, that's not the leak here.

There might be another leak where only an inner box is being called from raw instead of the outer, which could be this.

Unfortunately, I won't have my laptop until Tuesday and there's only so much I can do reading source on my phone.

As far as automating it, I would appreciate recommendations. We have some tests to ensure Js values aren't leaked, but native code is harder. V8/Node itself is leaky and chooses to let some memory be freed by the process ending instead of freeing; that makes it difficult to automate valgrind.

It's also difficult to simply look at process memory because V8 is greedy on allocations and doesn't free memory back to the OS often.

It's usually pretty obvious to spot leaks manually, maybe adding some manual checks before releases would be useful?

tl;dr The major leak appears to be related to EventQueue because memorizing it in a OnceCell appears to mostly resolve the issue. It also explains why I haven't observed it before since this usage is recommended for best performance (there's a proposal to inline this optimization for all users).

@pimeys
Copy link
Author

pimeys commented May 28, 2021

This example could be easier to catch automatically. You call a function 100000 times, and in the end you should not be using hundreds of megabytes of RAM.

Running a docker image controlled by the tests, you can monitor the resident set size in that image from the test, maybe that could work? But makes the test a bit harder to implement. I'm asking these because we in Prisma are planning to have some integration tests checking our memory usage, so I'm gathering good ideas all over the place.

@kjvalencik
Copy link
Member

kjvalencik commented May 28, 2021

Thanks @pimeys. That's a good suggestion; it's easy to catch the really bad leaks, even if it won't catch small leaks. Its good not to let perfection get in the way of good.

I would be interested in seeing how you automate it. Docker container with a strict memory limit seems fairly simple.

@pimeys
Copy link
Author

pimeys commented May 28, 2021

Yeah, so this is quite easy because what napi in the end is, is a way to communicate between JS and Rust. And when you communicate, there is data that moves around. Make that data bigger, and you see leaks faster. Or call it often enough.

Probably not that many cases where you have slow small leaks with this library.

@kjvalencik
Copy link
Member

kjvalencik commented Jun 1, 2021

I found the issue and I'll get a PR up soon with a fix. The issue boils down to a misunderstanding of how N-API references work.

I had interpreted that once the reference count hit 0, they would be automatically cleaned up. However, I must have missed the critical section of that docs that they also need to be deleted. The leak wasn't showing up in the V8 heap or the Rust structs because it was a N-API struct leaking.

https://nodejs.org/api/n-api.html#n_api_napi_delete_reference

tl;dr -- It's not a bug in unsafe code. It's a missing call to napi_delete_reference to fully drop the data structure.

After this change, the process sits at 26MB indefinitely.

@kjvalencik
Copy link
Member

It looks like this issue could also be in play. If Neon tracked its own reference count, it could delete instead of the last unref.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants