Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server down/stalled... #944

Closed
Martii opened this issue Apr 5, 2016 · 59 comments
Closed

Server down/stalled... #944

Martii opened this issue Apr 5, 2016 · 59 comments
Labels
bug You've guessed it... this means a bug is reported. HOST Usually the VPS. stability Important to operations.

Comments

@Martii
Copy link
Member

Martii commented Apr 5, 2016

I'm unable to get into the VPS to restart it and it's spinning in a web browser. NOTE: This is purely a VPS issue with our provider and not the project nor the node configuration.

Messaged @sizzlemctwizzle Cc: @jonleibowitz


Last script update on local pro at 2016-04-05T12:07:05.214Z

Refs:

@Martii Martii added the expedite Immediate and on the front burner. label Apr 5, 2016
Martii pushed a commit to Martii/OpenUserJS.org that referenced this issue Apr 6, 2016
* Reinstate *toobusy-js*... at least one of their timers has been fixed on shutdown. See OpenUserJS#354, OpenUserJS#353, OpenUserJS#352 and base issue of OpenUserJS#345 ... loosely related to OpenUserJS#249 and attempt to address OpenUserJS#944 with a work-around... VPS should be faster than our old one so perhaps the timers don't make as much of a difference. Start with our old default lag value... this may introduce too many 503's again but hopefully not
* Retested delete op
* Bug fixes, tests, and docs updates... please read their CHANGELOGS
* Shutdown the server on SIGINT
* Modify db closure to not have dependents
@Martii Martii mentioned this issue Apr 6, 2016
@Martii
Copy link
Member Author

Martii commented Apr 6, 2016

Found a break in the system... either a hiccup on whatever is causing this... or my distro update on laptop which may have a compatible client to connect to the latest Debian... or sizzle... although I did try Windows VM (virtual machine) and PM (physical machine), Debian VM, ArchLinux VM, ArchLinux PM, and other Linux PM too and those failed... so not entirely sure. (too many inet issues today everywhere)

I have already seen a 503 with toobusy-js on login that is probably GH's issue (still guessing here)... leaving this open for a more detailed investigation over the next few days. Apologies for this unscheduled outage... definitely out of my control at this time.

Btw dist-upgrade yielded no further updates. :\

@Martii
Copy link
Member Author

Martii commented Apr 6, 2016

Looks like it's still down.

@Martii
Copy link
Member Author

Martii commented Apr 6, 2016

No news yet.

@Martii Martii added bug You've guessed it... this means a bug is reported. HOST Usually the VPS. labels Apr 6, 2016
@Martii
Copy link
Member Author

Martii commented Apr 6, 2016

PENDING!... got 5 "too busy's" on login... but I have now have access... we'll see how long this stays up on the VPS.

@Martii
Copy link
Member Author

Martii commented Apr 6, 2016

One server restart detected... investigating.

@lelinhtinh
Copy link

503 ...

@Martii
Copy link
Member Author

Martii commented Apr 6, 2016

I know... it's going to take a bit to resolve this... something is chewing up memory and causing the VPS to crash... this was before the 503 addition of toobusy-js... I'm probably going to take the server down, do some recompilations, and see if that helps. e.g. that is why it's PENDING status right now.

Patience please. :)

@Martii Martii self-assigned this Apr 6, 2016
@Martii
Copy link
Member Author

Martii commented Apr 6, 2016

Downgrading node didn't help... seems that the malloc (or whichever lowlevel lib is being used) isn't freeing up memory in the distro/VPS.

I'm going to try disabling script minification just to be sure... with an environment variable to be added... don't worry I'll have it pass-through to the unminified so it doesn't break scripts.

@Martii
Copy link
Member Author

Martii commented Apr 6, 2016

Still losing memory, although slower, with disabling of script minification... e.g. the VPS is going to crash again... watching it right now go down, down, up, down, down, down, up, down, down, down, etc... until eventually there is zero free memory.

So systematically ruled out our project and reaffirmed this is a distro/VPS issue. :\

@Martii
Copy link
Member Author

Martii commented Apr 6, 2016

And there it goes. :\

@Martii
Copy link
Member Author

Martii commented Apr 6, 2016

Have to AFK for a few hours... will be back to try some other things as soon as I can. :\ Leaving the site OFFLINE for the moment.

@Martii
Copy link
Member Author

Martii commented Apr 7, 2016

@sizzlemctwizzle and anyone watching,
So I've put up a constant 503 on all routes at the moment... it's not very pretty but it will at least let everyone know that "we're busy... try again later" (better than nothing). This is hard-coded into the app.js with a manual FORCE_BUSY='true' in the env and not here on GH dev yet... still running some tests to see if this portion stays up. So far we are around a constant ~6% memory usage... will monitor this for a few hours... sleep in between... waking up... seeing if Debian has an update that fixes this before I make more reports, etc.

I've tried many different versions of node all result in the same issue with this kernel image on the VPS.. e.g. memory gets eaten up. Using the precompiled _node_s, the server lasts for less than 5 minutes... with a manual build from _node_s source I can sometimes get about 45 minutes of uptime. NEITHER OPTION IS SUITABLE as I can't babysit the server that constantly.

I've also looked into backing out/rolling back the last dist-upgrade and of course the old packages aren't available on the official repos... so that will fail.

Only three options are left that I can think of...

  1. Run the kernel recovery, assuming this works, and see if that works. There is exactly one snapshot, and only one total of any snapshot, of this bad VM... so twiddling can be undone now. Eventually there will be a decent snapshot that we can rollback to.
  2. Recreate the VM from scratch and see if it still does this memory leak... if it does switch distros in a new VM. Some of this is beyond my access as well as @sizzlemctwizzle did some configuration that I'm not aware of (yet?).
  3. Wait......................... (as for this option... adding tracking upstream... I'll have to create an issue on Debian first, then nodejs Cc: @mikeal ... after slumber though)

Just a sidenote... all script sources are intact as far as I can see in local pro. e.g. this is not a DB issue. (also made the HOST label here on GH as you might have noticed already)

@Martii Martii added the tracking upstream Waiting, watching, wanting. label Apr 7, 2016
Martii pushed a commit to Martii/OpenUserJS.org that referenced this issue Apr 7, 2016
* `BUSY_LAG` environment var so this can be twiddled with later
* `FORCE_BUSY` environment var to indicate technical difficulties with styling
* `FORCE_BUSY_ABSOLUTE` environment var to indicate technical difficulties with no UI
* Change the messages to suit

**NOTE**
This is also to test to see where the memory leak is happening... *mu2* isn't leaking since the hard-coded 503 has been in place and the average memory usage is around ~6%

Applies to OpenUserJS#944, OpenUserJS#249 and OpenUserJS#37
@Martii
Copy link
Member Author

Martii commented Apr 7, 2016

~6.5% peak memory usage with styling applied to 503's

Manually enabling /about routes to test stability


Reinstalled all deps, and their deps, and so on... no dist-upgrade available.

@Martii
Copy link
Member Author

Martii commented Apr 7, 2016

~6.5% nominal and ~15% peak memory usage with /about routes ... no leaks detected

Manually enabling /users route to test stability

@Martii
Copy link
Member Author

Martii commented Apr 7, 2016

~7.1% nominal and ~8.6% peak memory usage with /users route ... slightly slower to release memory on /users/username/comments ... this will be cumulative during testing.

Manually enabling /forum route to test stability

@Martii
Copy link
Member Author

Martii commented Apr 7, 2016

~6.4% nominal and ~7.4% peak memory usage with /forum route

Manually enabling all other discussions except /scripts route to issue discussions to test stability

@Martii
Copy link
Member Author

Martii commented Apr 7, 2016

~7.1% nominal and ~8.8% peak memory usage for global discussions

Manually enabling /group route excluding api search to test stability

@Martii
Copy link
Member Author

Martii commented Apr 7, 2016

~7.5% nominal and ~8.9% peak memory usage for /groups route

Manually enabling /libs route excluding general / route ... this doesn't include script installations just yet but does show Source Code tab ... to test stability

@Martii
Copy link
Member Author

Martii commented Apr 7, 2016

~7.5% nominal and ~7.7% peak memory usage for /libs route

Manually enabling /scripts route excluding general / route ... this also doesn't include script installations just yet but does show Source Code tab... to test stability

@OpenUserJS OpenUserJS unlocked this conversation Jan 14, 2022
@OpenUserJS OpenUserJS locked as resolved and limited conversation to collaborators Jan 14, 2022
@OpenUserJS OpenUserJS unlocked this conversation May 9, 2024
@OpenUserJS OpenUserJS locked as resolved and limited conversation to collaborators May 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug You've guessed it... this means a bug is reported. HOST Usually the VPS. stability Important to operations.
Development

No branches or pull requests

6 participants