Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node 4.2 large memory spikes and timeouts on Heroku #3370

Closed
AndrewBarba opened this issue Oct 14, 2015 · 43 comments
Closed

Node 4.2 large memory spikes and timeouts on Heroku #3370

AndrewBarba opened this issue Oct 14, 2015 · 43 comments
Labels
memory Issues and PRs related to the memory management or memory footprint. performance Issues and PRs related to the performance of Node.js.

Comments

@AndrewBarba
Copy link

Below is a screenshot of the past 12 hours which has been a total disaster for us to say the least. After updating to node 4.2 (from 0.10) in production we immediately exceeded all memory quotas and experienced a high volume of timeouts (even with no load and memory under the 1GB limit).

First, I apologize if this is not the place for this. I am happy to move discussion somewhere else and we will help diagnose whatever you guys need. We did this same parade with Node 0.12 and had to downgrade to 0.10.

Second, and I guess the real question here, is Heroku's 512MB of ram simply not enough to run Node 4.x? If that is the case, cool, but memory constraints definitely need to be made more clear.

Timeline:

  • Tue, Oct. 13th, at 2pm EST we deployed Node 4.2.0 with cluster enabled running 2 threads. Immediately hit 512 memory limit as seen in the picture below.
  • Tue, Oct. 13th, at 2:15pm EST we removed cluster completely. Hit memory limits 30 min later.
  • Tue, Oct. 13th, at 4:00pm EST we saw Node 4.2.1 was released, deployed 4.2.1. Continued to hit memory limits
  • Tue, Oct. 13th, at 6:00pm EST we doubled memory to 1GB

In general you can see the memory is all over the place, maybe that is expected with newer versions of V8...

screen shot 2015-10-14 at 12 08 03 pm

Although I don't have a screen, you can see in the first part of the graph running Node 0.10 that it stays almost perfectly flat at 256MB of ram. Under any load, that was consistent.

For reference, here is a load test we did in a dev environment running Node 4.2.1, cluster forked to 4 processes, and about 5k rpms. Also immediately hit the higher 1GB memory limit. We then dropped this down to 2 forked processes with the same result.

screen shot 2015-10-14 at 12 26 15 pm

@jasnell
Copy link
Member

jasnell commented Oct 14, 2015

@rvagg @trevnorris @Fishrock123 ... ideas?

@mscdex mscdex added the memory Issues and PRs related to the memory management or memory footprint. label Oct 14, 2015
@bnoordhuis
Copy link
Member

What happens when you start with (for example) node --max_old_space_size=256? The GC is lazy (lazier than v0.10) and the default upper limit is about 1.5 GB so you'll need to cap it if your machine has less memory.

@AndrewBarba
Copy link
Author

Okay just tried running with --max_old_space_size=256 using cluster and 4 processes. Hit the limit immediately so dropping down to 1 process again. Is the 256mb per process?

2015-10-14T16:46:29.457954+00:00 heroku[web.1]: source=web.1 dyno=heroku.18327029.28d659a4-2e6e-4ee5-9fb9-18dc6d99e1de sample#memory_total=578.31MB sample#memory_rss=445.58MB sample#memory_cache=0.00MB sample#memory_swap=132.73MB sample#memory_pgpgin=338334pages sample#memory_pgpgout=225287pages

@bnoordhuis
Copy link
Member

Is the 256mb per process?

Yes.

@AndrewBarba
Copy link
Author

Dropping down to 1 process --max_old_space_size=256 and --max_old_space_size=128 both returned VERY consistent results. Huge difference. Here is running at 128:
screen shot 2015-10-14 at 1 01 38 pm
And here is running at --max_old_space_size=512:
screen shot 2015-10-14 at 12 59 35 pm

@jasnell
Copy link
Member

jasnell commented Oct 14, 2015

@bnoordhuis ... looking at this, we likely definitely want to make sure this is documented better

@AndrewBarba
Copy link
Author

Yes very eye-opening for me at least. Fingers crossed, but I think you guys just saved us from having to revert everything again. Really appreciate the quick response. Going to leave this open through end of day while we do a few more tests on our end.

@bnoordhuis
Copy link
Member

@jasnell I don't disagree but where would you put it? Maybe it's time we add a FAQ.

@mscdex
Copy link
Contributor

mscdex commented Oct 14, 2015

FWIW I just added a FAQ to the wiki here and it is linked to on the main wiki page. Perhaps this could also be linked to somewhere on nodejs.org and other places?

@friism
Copy link

friism commented Oct 15, 2015

/cc @hunterloftis

@jbergstroem
Copy link
Member

perhaps dig through a few old issues that shared the same type of characteristics so google/seo/etc can drive people in the right direction?

@hunterloftis
Copy link

@friism I recommend --max_old_space_size when this occasionally pops up; the biggest downside is that when you specify that flag and then exceed the limit, your app will shut down hard. Heroku allows you to burst 5x the memory limit, so it's not always best to crash an app instead of letting it temporarily exceed soft limits.

This is probably documentation we should provide as well in a 'Memory-management with Node on Heroku' article.

@AndrewBarba
Copy link
Author

@hunterloftis From what we've seen, the second we hit a memory limit everything pretty much starts timing out. I'd actually rather a hard kill at that point so we can get a fresh node up that can respond to requests again.

You mention that if we specify --max_old_space_size the node will shut down immediately if we hit that limit, but doesn't that flag have to do with V8's heap and when it should start cleaning things up? I don't see how that relates to Heroku's memory limit and when a node is killed (other than exceeding Heroku's limit of course). In the tests above where we passed --max_old_space_size=128 the memory on the node was around 220MB.

@hunterloftis
Copy link

@AndrewBarba yeah it looks like your app is particularly hard-hit by lazy collection, and in that case the lesser evil is shutting down a process.

I'd be interested in hearing from @bnoordhuis on this but, from what I've seen, if your app requires more space than max_old_space_size allows (ie, you're actually storing > X mb in objects, vs storing < X mb in objects and needing more frequent sweeps to clear out things you don't store anymore)... then you'll get 'allocation failed' errors (hard shutdowns).

Also keep in mind that 'old space' is just one part of the application's memory footprint, so you can expect the whole app to take > max_old_space_size as a whole.

@trevnorris
Copy link
Contributor

Might try playing with flags like --gc_interval and/or --always_compact. Will slow down the application, but if there's a memory crunch then is worth giving it a go.

@hunterloftis
Copy link

For a kitchen sink example with arbitrary values:

node --gc_global --optimize_for_size --max_old_space_size=960 --use_idle_notification --always_compact --max_executable_size=64 --gc_interval=100 --expose_gc server.js

@bnoordhuis
Copy link
Member

@hunterloftis Yes, --max_old_space_size is a hard limit. --use_idle_notification is currently a no-op, by the way.

@claudiorodriguez
Copy link
Contributor

Just weighing in to share my experience: I had this happen to me with 0.12.x and 4.0.x - before I migrated everything to 4.0, Heroku Support recommended --max_old_space_size and it worked like a charm. I still get some instances of the error every now and then, but the affected dyno shuts down and no great harm is done, just a couple dozen timed out requests once a week or so (out of thousands per minute).
Also, migrating to 4.0.x cut down memory usage from an average of about 500MB to 270 per dyno.

@AndrewBarba
Copy link
Author

@fansworld-claudio That's great to hear. What size dyno are you running and what did you end up setting max_old_space_size to? Also have you used this setting with cluster?

@claudiorodriguez
Copy link
Contributor

@AndrewBarba At first I had the Standard-2X (1GB) dynos and max_old at 960, then after a couple weeks of stability I scaled it down to Standard-1X (512MB) and max_old at 480 - has been stable for a couple months. YMMV though, this is a REST API that does mostly i/o with redis and mongo.

@mike-zorn
Copy link

It's kinda a pain to set this for all your heroku apps (and remember to vary this based on dyno size), so I made heroku-node wraps node and sets the max_old_space_size based on the values of $WEB_MEMORY and $WEB_CONCURRENCY which heroku sets based on the size of dyno you've chosen.

@AndrewBarba
Copy link
Author

@ApeChimp I just gave that a go and this showed in the console immediately:

2015-10-31T14:51:14.352577+00:00 app[web.1]: <--- Last few GCs --->
2015-10-31T14:51:14.352577+00:00 app[web.1]:
2015-10-31T14:51:14.352579+00:00 app[web.1]:    14856 ms: Mark-sweep 48.0 (86.0) -> 48.0 (86.0) MB, 93.5 / 0 ms [allocation failure] [GC in old space requested].
2015-10-31T14:51:14.352580+00:00 app[web.1]:    14967 ms: Mark-sweep 48.0 (86.0) -> 48.0 (86.0) MB, 111.4 / 0 ms [allocation failure] [GC in old space requested].
2015-10-31T14:51:14.352581+00:00 app[web.1]:    15055 ms: Mark-sweep 48.0 (86.0) -> 48.0 (86.0) MB, 87.9 / 0 ms [last resort gc].
2015-10-31T14:51:14.352582+00:00 app[web.1]:    15178 ms: Mark-sweep 48.0 (86.0) -> 48.0 (86.0) MB, 122.7 / 0 ms [last resort gc].
2015-10-31T14:51:14.352582+00:00 app[web.1]:
2015-10-31T14:51:14.352583+00:00 app[web.1]:
2015-10-31T14:51:14.352583+00:00 app[web.1]: <--- JS stacktrace --->
2015-10-31T14:51:14.352584+00:00 app[web.1]:
2015-10-31T14:51:14.352585+00:00 app[web.1]: ==== JS stack trace =========================================

Looks like you are dividing WEB_MEMORY by some factor of concurrency but WEB_MEMORY as explained by that article you reference is the recommended memory for each process. As is, running this on the Performance M dyno would give you:

WEB_CONCURRENCY = 5;
WEB_MEMORY = 512;
WEB_MEMORY / (2 * WEB_CONCURRENCY) = 51.2MB

51.2MB of ram is not enough to even start up. I think you can just simplify to WEB_MEMORY / 2 or even WEB_MEMORY / 1.5. For us 256 seems to be the sweet spot so we'll stick with / 2

@mike-zorn
Copy link

@AndrewBarba, mea culpa That's fixed as of [email protected].

@piranna
Copy link
Contributor

piranna commented Nov 19, 2015

Couldn't the garbage collector also being executed when getting and out of memory exception? This would helps in a more generic way on systems with really low memory constrains, for example when executing npm on NodeOS it starts killing processes on QEmu with the default memory settings (128mb), and the same would happen on Raspberry Pi without swap...

@ChALkeR
Copy link
Member

ChALkeR commented Nov 19, 2015

@piranna What exact exception do you mean? If you are having allocation failures in mind, catching all of those in some generic way is a terrible idea — many things could go wrong from that. If you are speaking about the system oom killer, intercepting it requires specific settings on the system level.

@piranna
Copy link
Contributor

piranna commented Nov 19, 2015

@piranna What exact exception do you mean?

The error I get is an Out of Memory exception, thrown at process level:
pantallazo-qemu
It happens when calling malloc() and the system don't have enough free memory or a contiguos free chunk big enough for the requested size. I propose that when creating a new Javascript object, when trying to reserve memory capture the exception on the C++ area and if so, exec the garbage colector and try to reserve it again. Probably this is more a v8 issue...

@ChALkeR
Copy link
Member

ChALkeR commented Nov 19, 2015

@pirania That screenshot you have posted shows the OOM killer triggered. It's not interceptable by default.

Also, it's not necessary caused by node process. Your process could allocate some memory at the beginning, then do nothing and still be killed by the OOM killer later (while doing nothing).

@piranna
Copy link
Contributor

piranna commented Nov 19, 2015

@pirania That screenshot you have posted shows the OOM killer triggered. It's not interceptable by default.

What a shame, I though it would be a good feature :-(

Also, it's not necessary caused by node process. Your process could allocate some memory at the beginning, then do nothing and still be killed by the OOM killer later (while doing nothing).

Yeah, I know it's killing random processes after it. In fact in that screenshot the error was caused by slap but nsh stopped working too :-/

@joanniclaborde
Copy link

For what it's worth, here's the solution we're trying now (I sure wish I had found this conversation before!!):

if [ ! "$WEB_MEMORY" = "" ]; then
  if [ $WEB_MEMORY -le 512 ]; then
    NODE_FLAGS="--max_semi_space_size=2 --max_old_space_size=256 --max_executable_size=192"
  elif [ $WEB_MEMORY -le 768 ]; then
    NODE_FLAGS="--max_semi_space_size=8 --max_old_space_size=512 --max_executable_size=384"
  elif [ $WEB_MEMORY -le 1024 ]; then
    NODE_FLAGS="--max_semi_space_size=16 --max_old_space_size=1024 --max_executable_size=512"
  fi
fi

node $NODE_FLAGS "$@"

I'm getting the values for those flags from the V8 defaults, and it seems to be working great so far (Heroku / 512mb).

@piranna
Copy link
Contributor

piranna commented Dec 11, 2015

the default upper limit is about 1.5

Could it be possible to lower it when the sum of phisical memory + swap is smaller than this? It doesn't makes sense to have such a big limit if it's impossible to achieve it... This would help on memory-constrained systems.

@hunterloftis
Copy link

@piranna see #3370 (comment) and #3370 (comment)

@piranna
Copy link
Contributor

piranna commented Dec 11, 2015

@piranna see #3370 (comment) and #3370 (comment)

Yes, I've read them, what I'm asking for is to calculate and set them automatically on start instead of setting the values with some flags.

@joanniclaborde
Copy link

Or even a single flag could be a nice option: node --physical_memory=$WEB_MEMORY

@joanniclaborde
Copy link

I also added custom settings for modulus.io's 396mb servos:

if [ $1 -le 396 ]; then
  NODE_FLAGS="--max_semi_space_size=1 --max_old_space_size=198 --max_executable_size=148"
elif ...

@piranna
Copy link
Contributor

piranna commented Dec 11, 2015

Or even a single flag could be a nice option: node --physical_memory=$WEB_MEMORY

Is not max_old_space_size flag equivalent to this? Maybe it would be extended to accept auto to calculate it from the current system memory...

@TylerBrock
Copy link

So because setting max-old-space-size just crashes your application when the limit is reached it isn't very helpful. I made this https://github.com/HustleInc/regiment which will seamlessly create and preemptively replace workers so that you never hit max-old-space-size.

@TylerBrock
Copy link

Follow up: this is pure win. I'm running serveral 2x dynos in production for a very leaky app with Regiment.middleware.MemoryFootprint(750) and I've had no dropped requests, no memory overuse, and my aggregate memory usage hovers around 700-750mb/dyno.

pure_memory_win

You do not want to just set --max-semi-space-size, max-old-space-size, max-executable-size etc..., because when those limits are reached your node process simply crashes and any currently running requests are cut short.

@abienkowski
Copy link

Thank you for the detailed analysis.

Just spent all day debugging a critical infrastructure performance issues, and finally found that one of the services was move to v4.2.8. After rolling back to v0.12 everything returned to normal.

@ChALkeR ChALkeR added the performance Issues and PRs related to the performance of Node.js. label Feb 16, 2016
@bjfletcher
Copy link

@TylerBrock you mentioned that:

So because setting max-old-space-size just crashes your application when the limit is reached

I've been experimenting and noticed if --max-old-space-size is used without --max_semi_space_size being also adjusted then, yes, the application will get killed. If --max_semi_space_size is set to 1 then usually it doesn't get killed.

If anyone's interested, I've written more about the experiment here. If you have any ideas on how I can improve the experiment & understanding, I'd really appreciate it.

@davibe
Copy link

davibe commented Jul 12, 2016

I am testing --max-old-space-size=128 and i can confirm that when the application exceed 128mb of actually used memory the process gets killed.

@damianmr
Copy link

damianmr commented Jul 12, 2016

Hello, the info in this issue was very useful. I made a module based on @joanniclaborde suggestion.
It is here: https://github.com/damianmr/heroku-node-settings

The good thing is that the node process never reaches its max memory available, so it never crashes in Heroku.

We have been using it in a real project with thousand of requests per minute and had no crash whatsoever.

I encourage you to try it out and report any problems you find.

@joanniclaborde
Copy link

Nice!! We've been using those same settings for a few months now, and it's running fine. Good idea to turn that code into a module @damianmr !

@Lwdthe1
Copy link

Lwdthe1 commented Aug 12, 2018

🔥 I had a terrible memory leak somewhere: it turned out to be another node module that my system depends on. I took it out and my memory usage went from an average of 8GB down to 4GB. That still meant that I had to pay for the $500 a month plan to keep my app from exceeding the memory limit of the lower plans. I looked everywhere and haven't been able to find why my app is still consuming so much memory.

💯 🙌 Then I came across this today and after using @damianmr's heroku-node-settings. Now I have more time to look into any potential leaks without having to fork over $500/month since I've downgrade to the 2X plan and can actually do >1 dynos! Thanks everyone for the tips that lead to the solution. Here's a pic that's worth sharing:
screen shot 2018-08-11 at 8 34 21 pm

Strangely enough, my app's memory usage is showing as ~constant now rather than the exponential rise from before.
screen shot 2018-08-11 at 8 41 52 pm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
memory Issues and PRs related to the memory management or memory footprint. performance Issues and PRs related to the performance of Node.js.
Projects
None yet
Development

No branches or pull requests