Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is nan v2.0.0 leaking memory? #389

Closed
fivdi opened this issue Aug 1, 2015 · 36 comments
Closed

Is nan v2.0.0 leaking memory? #389

fivdi opened this issue Aug 1, 2015 · 36 comments

Comments

@fivdi
Copy link

fivdi commented Aug 1, 2015

I'm in the process of migrating the epoll module from nan v1.8.4 to nan v2.0.0. Everything is looking ok except for memory usage. There's an epoll test called brute-force-leak-check.js. This test is used to create millions of objects and while it's running the memory usage is observed to see if everything looks ok. With Node.js v0.10.40 and nan v1.8.4 memory usage is extremely stable and everything looks good. With Node.js v0.10.40 and nan v2.0.0 memory usage keeps increasing and eventually the test will terminate with:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

I can't figure out where the problem is and wrote the following test to see if the problem is in the epoll code or perhaps somewhere else:

const testRoot = require('path').resolve(__dirname, '..')
  , bindings = require('bindings')({ module_root: testRoot, bindings: 'objectwraphandle' });

var i, obj;

for (i = 1; i <= 1e9; i += 1) {
  obj = new bindings.MyObject(i);
  if (i % 1e6 === 0) {
    console.log(i);
  }
}

This test is similar to the epoll brute-force-leak-check but creates millions of nan MyObject instances instead (objectwraphandle.cpp).

If this test is placed in the nan test\js folder and run from there, the same issue can be seen. With Node.js v0.10.40 and nan v1.8.4 memory usage is extremely stable and everything looks good. With Node.js v0.10.40 and nan v2.0.0 memory usage keeps increasing and eventually the test will terminate after throwing an instance of std::bad_alloc.

Unfortunately, it may take a long time for std::bad_alloc to be thrown, even on a machine with only 2GB of RAM, but the constant increase in memory usage is immediately visible. This is the followed by swapping to the hard-disk after a relatively short period of time.

Additional information:

$ node -v
v0.10.40
$ uname -a
Linux linuxbox 3.13.0-58-generic #97-Ubuntu SMP Wed Jul 8 02:56:50 UTC 2015 i686 i686 i686 GNU/Linux
$ 

Is the test a little too stupid/synthetic or is there an issue somewhere?

@fivdi
Copy link
Author

fivdi commented Aug 1, 2015

Further testing shows that destructor ~MyObject is called when nan v1.8.4 is used, but not when nan v2.0.0 is used. Maybe something is being Ref'ed but not being Unref'ed when required.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 1, 2015

What if you try Node 0.12? Could be the Persistent here preventing the weak callback from firing.

@fivdi
Copy link
Author

fivdi commented Aug 1, 2015

Switching to Node.js v0.12.7 and nan v2.0.0 fully solves the issue for the test code listed in the 1st post above and partially solves the issue for the epoll brute-force-leak-check. This epoll test is still leaking, but at a slower rate. Maybe there is another similar issue in nan? or I've messed something up in epoll.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 2, 2015

I'm quite confident that Persistent I mentioned is leaking for < 0.12, I'll fix it by reimplementing the whole class tomorrow. I don't know about the rest.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 2, 2015

Try #392 and see if that helps with Node 0.10.

@fivdi
Copy link
Author

fivdi commented Aug 2, 2015

The following was performed with both Node.js v0.10.40 and v0.12.7:

git clone https://github.com/kkoopa/nan.git
cd nan/
git checkout memleak
npm i
npm run-script rebuild-tests

memory-leak-test.js, was created in nan/test/js:

const testRoot = require('path').resolve(__dirname, '..')
  , bindings = require('bindings')({ module_root: testRoot, bindings: 'objectwraphandle' });

var i, obj;

for (i = 1; i <= 1e9; i += 1) {
  obj = new bindings.MyObject(i);
  if (i % 1e6 === 0) {
    console.log(i);
  }
}

Here's the output when memory-leak-test.js is run using Node.js v0.10.40 (St9bad_alloc this time, not std::bad_alloc):

$ node memory-leak-test.js 
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
10000000
11000000
12000000
13000000
14000000
15000000
16000000
17000000
18000000
19000000
20000000
21000000
22000000
23000000
24000000
25000000
26000000
27000000
28000000
29000000
30000000
terminate called after throwing an instance of 'St9bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)
$ 

Here's the output when memory-leak-test.js is run using Node.js v0.12.7:

$ node memory-leak-test.js 
node: ../../nan_object_wrap.h:100: static void Nan::ObjectWrap::WeakCallback(const Nan::WeakCallbackInfo<Nan::ObjectWrap>&): Assertion `wrap->handle_.IsNearDeath()' failed.
Aborted (core dumped)
$ 

The leak is still there in Node.js v0.10.40 and v0.12.7 has a new issue.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 2, 2015

There, I've added a commit which should take care of the problems. Thanks for reproducible tests.

@fivdi
Copy link
Author

fivdi commented Aug 3, 2015

The epoll module was also leaking. Nan::HandleScope() was being used
instead of Nan::HandleScope scope and this appears to have resulted in leaks. The epoll nan2 branch has been updated with fixes.

Here's a summary of some test results (hopefully there are no mistakes here)

No. Node.js nan Test Result
1 v0.12.7 nan v2.0.0 memory-leak-test.js no leaks
2 v0.12.7 kkoopa nan memleak branch memory-leak-test.js no leaks
3 v0.12.7 nan v2.0.0 epoll nan2 branch - brute-force-leak-check.js no leaks
4 v0.12.7 kkoopa nan memleak branch epoll nan2 branch - brute-force-leak-check.js LEAKS
5 v0.10.40 nan v2.0.0 memory-leak-test.js leaks
6 v0.10.40 kkoopa nan memleak branch memory-leak-test.js no leaks
7 v0.10.40 nan v2.0.0 epoll nan2 branch - brute-force-leak-check.js leaks
8 v0.10.40 kkoopa nan memleak branch epoll nan2 branch - brute-force-leak-check.js LEAKS

The leak in nan v2.0.0 with v0.10.40 (test no. 5) is no longer there if the kkoopa nan memleak branch is used (test no. 6). However, there appears to be a new source of leaks. The epoll nan2 branch doesn't leak with nan v2.0.0 and v0.12.7 (test no. 3), but it does leak with the kkoopa nan memleak branch and v0.12.7 (test no. 4).

I won't have access to a development computer for the next few weeks.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 3, 2015

@bnoordhuis Any suggestions? I can't think of anything more that could be leaking.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 3, 2015

@fivdi What do you base the leaking on in tests 4 and 8? I'm seeing the same memory growth with both NAN 2.0.0 and NAN 1.8.4. In the nan2 branch you have also increased the count from 1e6 to 1e9, making it run longer and consume more memory.

@fivdi
Copy link
Author

fivdi commented Aug 3, 2015

The leaking in tests 4 and 8 is based on the fact that memory usage in test 3 is extremely stable and doesn't increase on the test machine. In test 4 memory usage keeps increasing and increasing and then there is swapping. I didn't have time to wait and see if there was a bad_alloc error with test 4 as I had to leave for the airport. The only difference between tests 3 and 4 is the branch of nan being used. The count was increased from 1e6 to 1e9 to test longer.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 4, 2015

Yeah, the memory usage is increasing, but I don't know why. The output from valgrind is useless. It reports a lot as definitely leaking, but the memory does get deallocated upoon garbage collection, yet the memory usage increases.

@fivdi
Copy link
Author

fivdi commented Aug 4, 2015

I'm not sure if everything that valgrind claims is a leak is really a leak when V8 is being used. AFAIK, when V8 knows that the process is about to terminate, destructors are purposely not called. I'm afraid I don't know why memory usage increases either.

@Raynos
Copy link

Raynos commented Aug 5, 2015

I've also seen a production memory leak.

Downgrading [email protected] to [email protected] fixes my memory leak.

It leaks about 200mbs/hr for me.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 5, 2015

@Raynos is that with the commits from #392 which fix part of the leaks?

@bnoordhuis
Copy link
Member

Preliminary finding: it doesn't look like a persistent handle leak. I see the number of persistent objects go up and down over time, like you'd expect in a GC-managed environment. Destructors are called properly, too.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 5, 2015

Same as I saw. Diffing snapshots from heapdump also shows nothing, but growth does happen when monitoring the process with top.

@bnoordhuis
Copy link
Member

Had some more time to look into it. I can't reproduce the std::bad_alloc exception with v3.0.0. RSS and heap total stabilize after a while, heap used fluctuates but that's the expected rise-then-drop saw wave of the GC at work.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 5, 2015

Did you try 0.12.7? That's what I used. With NAN 1.8.4 memory usage of Node stabilizes at 5 % of max, but with 2.0.1 it keeps growing past 20 % and does not seem to end.

On August 5, 2015 10:07:43 PM EEST, Ben Noordhuis [email protected] wrote:

Had some more time to look into it. I can't reproduce the
std::bad_alloc exception with v3.0.0. RSS and heap total stabilize
after a while, heap used fluctuates but that's the expected
rise-then-drop saw wave of the GC at work.


Reply to this email directly or view it on GitHub:
#389 (comment)

@fivdi
Copy link
Author

fivdi commented Aug 5, 2015

@kkoopa what is 2.0.1? master? If so, I wouldn't expect to see any problems with v0.12.7. This would be almost the same as tests 1 and 3 in the table above. With v0.12.7, I'd only expect to see issues with the constellation from test 4.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 5, 2015

No, the memleak branch.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 5, 2015

I just tried with io.js 3.0 and the memleak branch. Memory grew until about 80 % usage and seemed to hover fairly stable around there.
Ran brute-force-leak-check.js to 24800000 instances created and events detected

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  7823 kkoopa    20   0 1847916 801952   2988 R 100.1 79.9   3:56.78 node

VIRT grew constantly though.

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  7823 kkoopa    20   0 2423684 818144   1092 R  99.8 81.5   6:48.21 node

But something is still strange, as Node 0.12.7 would stay at 5 % with NAN 1.8.4 but would exceed that with NAN 2.0.0.

Why does a bunch of memory get allocated but never used?

@fivdi
Copy link
Author

fivdi commented Aug 5, 2015

The figures I saw with TOP were similar. RES was 1.6GB rather than 800MB but that was also about 80% on the test machine, if I remember correctly. It stayed at about 1.6GB and VIRT kept increasing and increasing. The increase in VIRT resulted in trashing and %CPU dropped to way below 100% and everything got really really slow.

My interpretation is not that memory usage (RES/RAM) hovers at 80% and remains stable, but that it's stuck at 80% because no more memory is available. The process wants more RAM, but it's not getting it. Instead, it gets VIRTual memory on the hard disk, starts trashing, and gets really really slow.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 6, 2015

You may be right about the analysis.

Assuming it is not handles that are leaking, I searched for references to malloc and new. Very few occurrences found. Only one possibly leaking is the uv_async_t at line 1570 of nan.h but that one is not even in use in this scenario, so it cannot affect the test program. All others seem to meet delete at some point.

What is left that could be leaking?

@bnoordhuis
Copy link
Member

@fivdi What happens when you pass --max_old_space_size=256? If the problem is that the JS heap grows larger than available physical memory, limiting the size of the old space (the main JS heap) should ameliorate that.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 6, 2015

It just keeps growing as before.

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  8349 kkoopa    20   0 1147904 327176   7852 R  94.3 32.6   2:20.59 node

It's the heap of the node process that grows at an alarming rate, independent of the js heap.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 6, 2015

I ran 0.12.7 with memwatch now. JS heap is fine, but process heap grows unfettered.

           2000000 instances created and events detected
{ num_full_gc: 20,
  num_inc_gc: 60,
  heap_compactions: 20,
  usage_trend: 0,
  estimated_base: 4079950,
  current_base: 4083424,
  min: 3641096,
  max: 4462424 }

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 6, 2015

Made some progress. Turns out the leak only happens when using the new Nan::ObjectWrap. If it is replaced with node::ObjectWrap, everything remains stable.

@fivdi
Copy link
Author

fivdi commented Aug 6, 2015

That's good news :)

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 6, 2015

And it seems to be either because of the Nan::Persistent itself, or more likely from the weak callback and the SetWeak.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 6, 2015

Further investigation shows that it is something with SetWeak and the weak callback that leaks.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 6, 2015

Seems like the WeakCallbackInfo created here only gets freed every other time from here. That is, every second time Nan::Persistent::SetWeak is run, the weak callback fails to execute.

@kkoopa
Copy link
Collaborator

kkoopa commented Aug 6, 2015

I still cannot wrap my head around why this acts in such a strange way. Localizing the problem only made it weirder.

@s-a
Copy link

s-a commented Aug 7, 2015

ref: s-a/iron-node#44

addaleax added a commit to addaleax/lzma-native that referenced this issue Aug 9, 2015
nan 2.x.x support currently blocks on nodejs/nan#389.
Until then, don’t test iojs v3.0.0 (for which the build fails
with nan 1.x.x)
kkoopa added a commit that referenced this issue Aug 10, 2015
@kkoopa
Copy link
Collaborator

kkoopa commented Aug 10, 2015

Fixed in 98d38c1

@kkoopa kkoopa closed this as completed Aug 10, 2015
@fivdi
Copy link
Author

fivdi commented Aug 24, 2015

@kkoopa good work! memory usage is nice and stable now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants