A possible memory leak in callbacks/root #746

pimeys · 2021-05-27T15:37:49Z

Hi,

I was evaluating the new N-API codebase, and found quite fast greatly leaking piece of code. Here's the test repo that will show the leak:

https://github.com/pimeys/hello-neon

Now, I did some digging with Valgrind:

As you can see the culprit is in napi_create_reference, called from Root::new. You can dig into the results from the included massif dump! I was not able to collect this data, even by forcing global.gc(), meaning we probably do something wrong in the native side.

I'd expect this loop to have a steady flat memory profile, the data should be collected in the young-space already...

The text was updated successfully, but these errors were encountered:

pimeys · 2021-05-27T18:54:11Z

n7: 6550942 (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
      
      
         n1: 4627840 0x91FF79: napi_create_reference (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
          n1: 4627840 0x9E84B63: neon_runtime::napi::bindings::functions::napi1::create_reference (mod.rs:163)
      
      
           n1: 4627840 0x9E7C52D: neon_runtime::napi::reference::new (reference.rs:11)
      
      
            n1: 4627840 0x9DD8727: neon::handle::root::Root&lt;T&gt;::new (root.rs:65)
      
      
             n1: 4627840 0x9DDD60F: neon::object::traits::Object::root (mod.rs:238)
      
      
              n1: 4627840 0x9DCF08D: hello_neon::QueryEngine::js_query (lib.rs:60)
      
      
               n1: 4627840 0x9DCB95A: &lt;neon::types::internal::FunctionCallback&lt;T&gt; as neon::types::internal::Callback&lt;*mut neon_runtime::napi::bindings::types::Value__&gt;&gt;::invoke::{{closure}}::{{closure}} (internal.rs:62)
      
      
                n1: 4627840 0x9DC5E82: neon::types::error::convert_panics::{{closure}} (error.rs:86)
      
      
                 n1: 4627840 0x9DE3404: std::panicking::try::do_call (panicking.rs:379)
      
      
                  n1: 4627840 0x9DE385C: __rust_try (in /home/pimeys/code/hello-neon/index.node)
      
      
                   n1: 4627840 0x9DE2B9D: std::panicking::try (panicking.rs:343)
      
      
                    n1: 4627840 0x9DD0ADD: std::panic::catch_unwind (panic.rs:431)
      
      
                     n1: 4627840 0x9DC530C: neon::types::error::convert_panics (error.rs:86)
      
      
                      n1: 4627840 0x9DCB5F8: &lt;neon::types::internal::FunctionCallback&lt;T&gt; as neon::types::internal::Callback&lt;*mut neon_runtime::napi::bindings::types::Value__&gt;&gt;::invoke::{{closure}} (internal.rs:62)
      
      
                       n1: 4627840 0x9DE129E: neon::context::CallContext&lt;T&gt;::with::{{closure}} (mod.rs:710)
      
      
                        n1: 4627840 0x9DC954A: neon::context::internal::Scope&lt;R&gt;::with (internal.rs:99)
      
      
                         n1: 4627840 0x9DE0F7E: neon::context::CallContext&lt;T&gt;::with (mod.rs:709)
      
      
                          n1: 4627840 0x9DE042B: neon::context::CallbackInfo::with_cx (mod.rs:204)
      
      
                           n1: 4627840 0x9DCB1F8: &lt;neon::types::internal::FunctionCallback&lt;T&gt; as neon::types::internal::Callback&lt;*mut neon_runtime::napi::bindings::types::Value__&gt;&gt;::invoke (internal.rs:58)
      
      
                            n1: 4627840 0x917D63: v8impl::(anonymous namespace)::FunctionCallbackWrapper::Invoke(v8::FunctionCallbackInfo&lt;v8::Value&gt; const&amp;) (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                             n1: 4627840 0xB79761: v8::internal::FunctionCallbackArguments::Call(v8::internal::CallHandlerInfo) (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                              n1: 4627840 0xB79B1C: v8::internal::MaybeHandle&lt;v8::internal::Object&gt; v8::internal::(anonymous namespace)::HandleApiCallHelper&lt;false&gt;(v8::internal::Isolate*, v8::internal::Handle&lt;v8::internal::HeapObject&gt;, v8::internal::Handle&lt;v8::internal::HeapObject&gt;, v8::internal::Handle&lt;v8::internal::FunctionTemplateInfo&gt;, v8::internal::Handle&lt;v8::internal::Object&gt;, v8::internal::BuiltinArguments) (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                               n1: 4627840 0xB7A379: v8::internal::Builtin_Impl_HandleApiCall(v8::internal::BuiltinArguments, v8::internal::Isolate*) (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                n1: 4627840 0xB7AD48: v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                 n4: 4627840 0x13EDC78: Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_BuiltinExit (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                  n1: 3918080 0x3B1F7A804DC1: ???
      
      
                                   n1: 3918080 0x13B360F: Builtins_AsyncFunctionAwaitResolveClosure (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                    n1: 3918080 0x142F90D: Builtins_PromiseFulfillReactionJob (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                     n1: 3918080 0x13A6739: Builtins_RunMicrotasks (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                      n0: 3918080 0x13846B7: Builtins_JSRunMicrotasksEntry (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                  n1: 289840 0x1386A41: Builtins_InterpreterEntryTrampoline (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                   n1: 289840 0x142DEDF: Builtins_PromiseConstructor (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                    n1: 289840 0x1382BE9: Builtins_JSBuiltinsConstructStub (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                     n1: 289840 0x146BF42: Builtins_ConstructHandler (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                      n0: 289840 0x1386A41: Builtins_InterpreterEntryTrampoline (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                  n1: 214640 0x3B1F7A8045E6: ???
      
      
                                   n1: 214640 0x1386A41: Builtins_InterpreterEntryTrampoline (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                    n1: 214640 0x13B360F: Builtins_AsyncFunctionAwaitResolveClosure (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                     n1: 214640 0x142F90D: Builtins_PromiseFulfillReactionJob (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                      n0: 214640 0x13A6739: Builtins_RunMicrotasks (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                  n1: 205280 0x3B1F7A803EC6: ???
      
      
                                   n1: 205280 0x1386A41: Builtins_InterpreterEntryTrampoline (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                    n1: 205280 0x1386A41: Builtins_InterpreterEntryTrampoline (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                     n1: 205280 0x13B360F: Builtins_AsyncFunctionAwaitResolveClosure (in /nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)
      
      
                                      n0: 205280 
0x142F90D: Builtins_PromiseFulfillReactionJob (in 
/nix/store/ncn3gm18lzqv4w3jrfpk5yqbw4swdz03-nodejs-slim-14.17.0/bin/node)

kjvalencik · 2021-05-27T19:57:47Z

It might be the flat loop in Js causing the issue, never giving a chance for values to be dropped from scope. I'm not sure if async loops are oprmizied for that.

Instead of valgrind, can you try node --inspect and the Chrome debugger to take a heap dump? That will give a clearer picture of the objects being held and what is holding the references.

Neon has a couple of unit tests to verify Root do not leak:

neon/test/napi/lib/threads.js

Line 51 in f3a96aa

it('should run callback on drop', function (cb) {

pimeys · 2021-05-27T20:01:48Z

I'd think that first too, but we're currently evaluating napi-rs and finding similar problems from their side too. Recently the same loop, after fixing all leaks, results a flat response from valgrind.

I also took another try with prometheus/docker/grafana and monitoring docker stats. When I run this loop with neon, the memory graph goes gigabytes and gigabytes up until I kill the process. With napi-rs, using the latest fixes, I'm staying in healthy 30 megabytes of RAM all the time.

kjvalencik · 2021-05-27T20:12:52Z

@dherman I'm on vacation, if you could take a look that would be great, otherwise I'll look into this next week.

dherman · 2021-05-27T20:14:00Z

@pimeys Thanks for the detective work! I'm curious, were the fixes in napi-rs ones you made locally or something they seem to have fixed in recent versions?

@kjvalencik I'll see what I can figure out.

pimeys · 2021-05-27T20:16:22Z

There are a few, and their architecture are quite different compared to neon, so it's hard to compare. But yes, the changes were in their promise handling: napi-rs/napi-rs#579

This was mostly incorrect use of pointers, that lead to certain things not getting dropped. I quickly scanned the neon codebase too, but couldn't find anything similar...

pimeys · 2021-05-27T20:20:08Z

You can clone my test repo above, and run the loop. Already htop shows you how the RSS of node creeps up. It should not do that, the correct behavior should be going up about 30-40 megabytes and staying there until killed.

Yes, we send lots of data. But all of this is something that the scavenge collector should be able to clean. What it looks like to me is some pointers somewhere still exist, with data in the heap. Rust cannot drop it, and the GC doesn't know about it, which triggers the massive usage of the heap.

dherman · 2021-05-27T20:51:48Z

Confirmed there's no leak in the V8 heap with the Chrome inspector but top shows the steady process memory growth. @kjvalencik found a possible source of the leak but I have to run errands for a bit, we'll test out fixes later this afternoon. I'll keep the thread updated with our progress…

pimeys · 2021-05-28T07:21:30Z

I'm very interested on reading the fix! Do you have an idea how to test and detect leak regressions such as this? Would make sense maybe to ps the resident set first, then call a function for N times and ps again, diffing and throwing if too high. You can't really trust the javascript metrics here, if v8 doesn't know about the leaked memory.

dherman · 2021-05-28T13:14:53Z

Oh my approach so far has been cruder: I just added a console.log every 1000 iterations, ran it with node --inspect index.JS and opened the Chrome devtools and manually took heap snapshots periodically while watching the console output scroll by. The devtools showed the heap size staying around 3MB consistently in every snapshot.

KJ’s hypothesis was just from inspecting the code but it wasn’t the culprit (it was only relevant for N-API 6).

There aren’t too many allocations involved in the event queue API so I’ve been instrumenting the data structures (EventQueue, Root, ThreadsafeFunction) with Drop implementations that print when they get called, to check that they’re getting dropped the right number of times, but it hasn’t turned up any misbehavior so far. Looking at the Box allocations next...

pimeys · 2021-05-28T13:16:43Z

And those pesky ManuallyDrop uses. Easy to use them wrong...

kjvalencik · 2021-05-28T14:50:07Z

There's at least one minor leak that was introduced to fix leaks on Root. :) But, that leak is napi-6 only. There's a manuallydrop that used to be only a single field, but an Arc was added. But, that's not the leak here.

There might be another leak where only an inner box is being called from raw instead of the outer, which could be this.

Unfortunately, I won't have my laptop until Tuesday and there's only so much I can do reading source on my phone.

As far as automating it, I would appreciate recommendations. We have some tests to ensure Js values aren't leaked, but native code is harder. V8/Node itself is leaky and chooses to let some memory be freed by the process ending instead of freeing; that makes it difficult to automate valgrind.

It's also difficult to simply look at process memory because V8 is greedy on allocations and doesn't free memory back to the OS often.

It's usually pretty obvious to spot leaks manually, maybe adding some manual checks before releases would be useful?

tl;dr The major leak appears to be related to EventQueue because memorizing it in a OnceCell appears to mostly resolve the issue. It also explains why I haven't observed it before since this usage is recommended for best performance (there's a proposal to inline this optimization for all users).

pimeys · 2021-05-28T15:28:40Z

This example could be easier to catch automatically. You call a function 100000 times, and in the end you should not be using hundreds of megabytes of RAM.

Running a docker image controlled by the tests, you can monitor the resident set size in that image from the test, maybe that could work? But makes the test a bit harder to implement. I'm asking these because we in Prisma are planning to have some integration tests checking our memory usage, so I'm gathering good ideas all over the place.

kjvalencik · 2021-05-28T17:04:27Z

Thanks @pimeys. That's a good suggestion; it's easy to catch the really bad leaks, even if it won't catch small leaks. Its good not to let perfection get in the way of good.

I would be interested in seeing how you automate it. Docker container with a strict memory limit seems fairly simple.

pimeys · 2021-05-28T17:06:31Z

Yeah, so this is quite easy because what napi in the end is, is a way to communicate between JS and Rust. And when you communicate, there is data that moves around. Make that data bigger, and you see leaks faster. Or call it often enough.

Probably not that many cases where you have slow small leaks with this library.

kjvalencik · 2021-06-01T16:50:09Z

I found the issue and I'll get a PR up soon with a fix. The issue boils down to a misunderstanding of how N-API references work.

I had interpreted that once the reference count hit 0, they would be automatically cleaned up. However, I must have missed the critical section of that docs that they also need to be deleted. The leak wasn't showing up in the V8 heap or the Rust structs because it was a N-API struct leaking.

https://nodejs.org/api/n-api.html#n_api_napi_delete_reference

tl;dr -- It's not a bug in unsafe code. It's a missing call to napi_delete_reference to fully drop the data structure.

After this change, the process sits at 26MB indefinitely.

Fixes #746

kjvalencik · 2021-09-10T18:44:55Z

It looks like this issue could also be in play. If Neon tracked its own reference count, it could delete instead of the last unref.

kjvalencik added a commit that referenced this issue Jun 1, 2021

fix(neon-runtime): Fix a leak from references not being deleted

fd16937

Fixes #746

kjvalencik mentioned this issue Jun 1, 2021

Memory leak in Root #750

Merged

kjvalencik closed this as completed in 8b26072 Jun 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A possible memory leak in callbacks/root #746

A possible memory leak in callbacks/root #746

pimeys commented May 27, 2021

pimeys commented May 27, 2021

kjvalencik commented May 27, 2021

pimeys commented May 27, 2021 •

edited

Loading

kjvalencik commented May 27, 2021

dherman commented May 27, 2021

pimeys commented May 27, 2021 •

edited

Loading

pimeys commented May 27, 2021

dherman commented May 27, 2021

pimeys commented May 28, 2021

dherman commented May 28, 2021

pimeys commented May 28, 2021

kjvalencik commented May 28, 2021 •

edited

Loading

pimeys commented May 28, 2021

kjvalencik commented May 28, 2021 •

edited

Loading

pimeys commented May 28, 2021

kjvalencik commented Jun 1, 2021 •

edited

Loading

kjvalencik commented Sep 10, 2021

A possible memory leak in callbacks/root #746

A possible memory leak in callbacks/root #746

Comments

pimeys commented May 27, 2021

pimeys commented May 27, 2021

kjvalencik commented May 27, 2021

pimeys commented May 27, 2021 • edited Loading

kjvalencik commented May 27, 2021

dherman commented May 27, 2021

pimeys commented May 27, 2021 • edited Loading

pimeys commented May 27, 2021

dherman commented May 27, 2021

pimeys commented May 28, 2021

dherman commented May 28, 2021

pimeys commented May 28, 2021

kjvalencik commented May 28, 2021 • edited Loading

pimeys commented May 28, 2021

kjvalencik commented May 28, 2021 • edited Loading

pimeys commented May 28, 2021

kjvalencik commented Jun 1, 2021 • edited Loading

kjvalencik commented Sep 10, 2021

pimeys commented May 27, 2021 •

edited

Loading

pimeys commented May 27, 2021 •

edited

Loading

kjvalencik commented May 28, 2021 •

edited

Loading

kjvalencik commented May 28, 2021 •

edited

Loading

kjvalencik commented Jun 1, 2021 •

edited

Loading