Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graceful shutdown #60059

Merged
merged 35 commits into from
Jan 16, 2024
Merged

graceful shutdown #60059

merged 35 commits into from
Jan 16, 2024

Conversation

redbmk
Copy link
Contributor

@redbmk redbmk commented Dec 30, 2023

  • Both the standalone server and the startServer function it calls attempt to stop the server on SIGINT and SIGTERM in different ways. This lets server.js yield to startServer
  • The cleanup function in startServer was not waiting for the server to close before calling process.exit. This lets it wait for any in-flight requests to finish processing before exiting the process
  • Sends SIGKILL to the child process in next dev, which should have the same effect of immediately shutting down the server on SIGTERM or SIGINT

fixes: #53661
refs: #59551


Previously #59551 attempted to fix #53661, but had broken some tests in the process. It looks like the final commit was also missing an intended change to utils.ts. This should fix those issues as well as introduce a new set of tests for the graceful shutdown feature.

In the last PR I was squashing and force-pushing updates along the way but it made it difficult to track the changes. This time I'm pushing quite a few commits to make it easier to track the changes and refactors I've made, with the idea that this should be squashed before being merged.

This reverts commit 771705f, which
reverted 2c48b87 because it was
breaking some test cases.

Coming back to this as a starting point to try to fix the tests and add
new tests for the graceful shutdown feature.
- remove immediate process.exit from `next start`
- have handleSessionStop wait for the child to exit
- add a test that verifies long-running api routes are not killed
- send proper signal (e.g. SIGTERM or SIGINT) to child process from dev
  server
- wait for app to shut down in tests, with a short default timeout
- when a SIGINT or SIGTERM is sent to next dev, shutdown the server immediately
- create new test app for graceful-shutdown features
- add some more tests
- shouldn't accept requests while server is shutting down and there's
  no activity
- refactor some tests to use process.kill since it's faster and we
  aren't looking to kill the whole tree. makes killApp simpler too
- moved some js files to ts and fixed types
- fixed standalone server graceful-shutdown
- added new tests for production standalone mode
rather than having an integration test with dev and production modes,
and then a separate test for standalone-mode that copies all the same
tests, we can just follow the same pattern we had in integration but
include the standalone-mode tests in the same file.
@ijjk
Copy link
Member

ijjk commented Dec 30, 2023

Allow CI Workflow Run

  • approve CI run for commit: 4f74aee

Note: this should only be enabled once the PR is ready to go and can only be enabled by a maintainer

@ijjk
Copy link
Member

ijjk commented Dec 30, 2023

Allow CI Workflow Run

  • approve CI run for commit: f3c875d

Note: this should only be enabled once the PR is ready to go and can only be enabled by a maintainer

the tsconfig.json file was causing typescript issues when looking for e2e-utils and next-test-utils
@@ -108,12 +108,6 @@ if (process.env.NODE_ENV) {
;(process.env as any).NODE_ENV = process.env.NODE_ENV || defaultEnv
;(process.env as any).NEXT_RUNTIME = 'nodejs'

// Make sure commands gracefully respect termination signals (e.g. from Docker)
// Allow the graceful termination to be manually configurable
if (!process.env.NEXT_MANUAL_SIG_HANDLE && command !== 'dev') {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing this one is necessary for next dev and next start to gracefully shutdown. The env is still in start-server

@@ -2049,13 +2049,6 @@ const dir = path.join(__dirname)
process.env.NODE_ENV = 'production'
process.chdir(__dirname)
// Make sure commands gracefully respect termination signals (e.g. from Docker)
// Allow the graceful termination to be manually configurable
if (!process.env.NEXT_MANUAL_SIG_HANDLE) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing this one is necessary for standalone mode to gracefully shutdown. The env is still in start-server

exitResolve()
})
if (this.childProcess && !this.isStopping) {
const exitPromise = once(this.childProcess, 'exit')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had started making changes here and turned out they weren't needed, but I thought I'd keep this refactor since once seemed a little cleaner when I learned about it.

// Kill a launched app
export async function killApp(instance: ChildProcess) {
if (instance && instance.pid) {
if (instance?.pid && isAppRunning(instance)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to kill the app if it's already running.

expect(isAppRunning(app)).toBe(true)

// yield event loop to allow server to start the shutdown process
await waitFor(20)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was running into some issues with these tests. Waiting 20 ms seems to be the sweet spot but I'm a little worried there could be race conditions causing this to sometimes fail.

@@ -0,0 +1,6 @@
export const LONG_RUNNING_MS = 400
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had lowered this to 100 but seemed to run into less race conditions bumping it up higher. In practice, 400ms wouldn't really be a "long running" request but should be enough to simulate the issue in tests.

@redbmk
Copy link
Contributor Author

redbmk commented Dec 30, 2023

@huozhi, pinging you since you reviewed the last PR recently. One thing I noticed was that one of the tests that had failed in the previous PR I was able to get to fail randomly on the latest (as of a week or so ago) canary. #59551 (review)

If I ran it locally a few times it would sometimes pass and sometimes fail. Not sure if it's been fixed since then.

The other tests that were failing though I think I've fixed

@redbmk redbmk changed the title Graceful shutdown draft: Graceful shutdown Dec 31, 2023
@redbmk
Copy link
Contributor Author

redbmk commented Dec 31, 2023

I had marked this as a draft because I wanted to run a couple more manual tests (now complete and marked as ready):

  • make sure the previously failing tests from the last PR are passing
  • apply these changes locally and run the dev and prod servers to make sure the behavior feels correct
  • run the new tests a decent number of times to make sure they aren't occasionally failing (eg due to race conditions)

make sure the previously failing tests from the last PR are passing

✅ Tests are now passing locally It looks like there were two failing tests (the same test suite but with and without turbopack):

test turbopack integration (5/5)

  FAIL test/integration/telemetry/test/page-features.test.js (33.361 s)
    page features telemetry
      ✓ detects --turbo correctly for `next dev` (1598 ms)
      ✕ detects --turbo correctly for `next dev` stopped (31496 ms)

test integration (10/12)

  FAIL test/integration/telemetry/test/page-features.test.js (76.873 s)
    page features telemetry
      ✕ detects correctly for `next dev` stopped (no turbo) (33721 ms)
      production mode
        ✓ should detect app page counts (14472 ms)
        ✓ detects reportWebVitals with no _app correctly for `next build` (11362 ms)
        ✓ detect with reportWebVitals correctly for `next build` (8619 ms)
        ✓ detect without reportWebVitals correctly for `next build` (8401 ms)

These are both passing now (at least locally for me):

❯ pnpm testheadless test/integration/telemetry/test/page-features.test.js
...
 PASS  test/integration/telemetry/test/page-features.test.js (63.892 s)
  page features telemetry
    ✓ detects correctly for `next dev` stopped (no turbo) (4125 ms)
    production mode
      ✓ should detect app page counts (19577 ms)
      ✓ detects reportWebVitals with no _app correctly for `next build` (13498 ms)
      ✓ detect with reportWebVitals correctly for `next build` (12775 ms)
      ✓ detect without reportWebVitals correctly for `next build` (13393 ms)
      
❯ TURBOPACK=1 pnpm testheadless test/integration/telemetry/test/page-features.test.js
...
 PASS  test/integration/telemetry/test/page-features.test.js
  page features telemetry
    ✓ detects --turbo correctly for `next dev` (2121 ms)
    ✓ detects --turbo correctly for `next dev` stopped (2207 ms)

There was also one that seemed to be sporadically failing, but running 20 times with hyperfine passed each time:

❯ hyperfine --min-runs 20 'pnpm testheadless test/integration/config-output-export/test/index.test.ts -t "getStaticPaths and fallback false"'
Benchmark 1: pnpm testheadless test/integration/config-output-export/test/index.test.ts -t "getStaticPaths and fallback false"
  Time (mean ± σ):      7.014 s ±  0.744 s    [User: 5.217 s, System: 1.515 s]
  Range (min … max):    6.072 s …  8.697 s    20 runs

apply these changes locally and run the dev and prod servers to make sure the behavior feels correct

⚠️ Working with caveats. See TL;DR below or read full details

next dev

When I hit ctrl-c, the next dev instance shuts down immediately and the server stops running. This is what I would expect. It might be nice to see the server gracefully shut down, but if I hit ctrl-c in a dev server I probably want it to stop right away.

next start ⚠️

I actually see a pretty big variety of behavior depending on the way next start is invoked. The actual process that runs the server is renamed (using process.title) to next-server, but if invoked through a package manager, there end up being two processes (e.g. npm exec next start or bunx next start). In all cases, sending SIGTERM or SIGINT using kill to the next-server process works as expected. However, the behavior is different if sending to the parent process (if there is one).

  • ./node_modules/.bin/next start or node node_modules/.bin/next start
    • ctrl-c: ✅ in-flight requests will continue until they're done, then the service will stop
    • In docker: ✅ in-flight requests will continue until they're done, then the service will stop
  • npx next start ⚠️
    • ctrl-c: ✅ in-flight requests will continue until they're done, then the service and parent will stop
    • kill -SIGTERM $parent_pid: ⚠️ no effect on parent or child
    • kill -SIGINT $parent_pid: ⚠️ no effect on parent or child
    • kill -SIGKILL $parent_pid: ⚠️⚠️ parent exits immediately and next-server gets left behind as a zombie
    • kill -SIGTERM $child_pid: ✅ in-flight requests will continue until they're done, then the service and parent will stop
    • kill -SIGINT $child_pid: ✅ in-flight requests will continue until they're done, then the service and parent will stop
    • kill -SIGKILL $child_pid: ✅ child exits immediately. parent receives signal and also shuts down immediately
    • In docker: ⚠️ no response. container gets SIGKILL after grace period
  • pnpm exec next start ⚠️
    • ctrl-c: ⚠️ In-flight requests continue until they're done, but the parent process dies immediately, dropping you into the shell
    • kill -SIGTERM $parent_pid: ⚠️ In-flight requests continue until they're done, but the parent process dies immediately
    • kill -SIGINT $parent_pid: ⚠️ In-flight requests continue until they're done, but the parent process dies immediately
    • kill -SIGKILL $parent_pid: ⚠️⚠️ parent exits immediately and next-server gets left behind as a zombie
    • kill -SIGTERM $child_pid: ✅ in-flight requests will continue until they're done, then the service and parent will stop
    • kill -SIGINT $child_pid: ✅ in-flight requests will continue until they're done, then the service and parent will stop
    • kill -SIGKILL $child_pid: ✅ child exits immediately. parent receives signal and also shuts down immediately
    • In docker: ✅ in-flight requests will continue until they're done, then the service will stop
  • yarn exec next start
    • ctrl-c: ⚠️ In-flight requests continue until they're done, but the parent process dies immediately, dropping you into the shell
    • kill -SIGTERM $parent_pid: ⚠️⚠️ parent exits immediately and next-server gets left behind as a zombie
    • kill -SIGINT $parent_pid: ⚠️⚠️ parent exits immediately and next-server gets left behind as a zombie
    • kill -SIGKILL $parent_pid: ⚠️⚠️ parent exits immediately and next-server gets left behind as a zombie
    • kill -SIGTERM $child_pid: ✅ in-flight requests will continue until they're done, then the service and parent will stop
    • kill -SIGINT $child_pid: ✅ in-flight requests will continue until they're done, then the service and parent will stop
    • kill -SIGKILL $child_pid: ✅ child exits immediately. parent receives signal and also shuts down immediately
    • In docker: ⚠️ container shuts down immediately
  • bunx next start ⚠️
    • ctrl-c: ⚠️ In-flight requests continue until they're done, but the parent process dies immediately, dropping you into the shell
    • kill -SIGTERM $parent_pid: ⚠️⚠️ parent exits immediately and next-server gets left behind as a zombie
    • kill -SIGINT $parent_pid: ⚠️⚠️ parent exits immediately and next-server gets left behind as a zombie
    • kill -SIGKILL $parent_pid: ⚠️⚠️ parent exits immediately and next-server gets left behind as a zombie
    • kill -SIGTERM $child_pid: ✅ in-flight requests will continue until they're done, then the service and parent will stop
    • kill -SIGINT $child_pid: ✅ in-flight requests will continue until they're done, then the service and parent will stop
    • kill -SIGKILL $child_pid: ✅ child exits immediately. parent receives signal and also shuts down immediately
    • In docker: ⚠️ no response. container gets SIGKILL after grace period

To summarize, if SIGINT or SIGTERM is sent to next-server, everything behaves as expected in all situations. When using a package manager, results vary when sending a signal to the parent/package mananger (e.g. bun, pnpm, npm, yarn), but it seems like that's more-or-less expected. If that behavior isn't expected, it seems more like an issue with the package manager than with next, especially since they all behave slightly differently.

In Docker, the only versions that behave as expected were using pnpm (which is surprising because it didn't behave as intended outside of Docker when sending SIGTERM), or running the next bin directly.

Standalone mode ✅

I've tested this in Docker both in docker-compose and kubernetes, which both send the SIGTERM signal. If I fire off a long-running request before sending the SIGTERM, the request will finish before the container exits in both environments. In docker-compose, it seems that new fetch requests from an already-loaded page will still go through, but refreshing the page or opening in another tab will not work. In kubernetes, no new requests will be accepted during that time. I suspect that the difference here is HTTP 1.1 and the Keep-Alive connection, vs HTTP 2.


run the new tests a decent number of times to make sure they aren't occasionally failing (eg due to race conditions)

✅ Ran tests 20 times with no errors ran 20 tests with `hyperfine` and didn't have any errors. not exactly fast because it has to artificially wait and also spin up a new server for each test
❯ hyperfine --min-runs=20 "pnpm testheadless test/production/graceful-shutdown"
Benchmark 1: pnpm testheadless test/production/graceful-shutdown
  Time (mean ± σ):     73.672 s ±  6.789 s    [User: 49.236 s, System: 18.610 s]
  Range (min … max):   60.592 s … 82.536 s    20 runs

TL;DR:

  • The new tests and previously failing tests are all passing locally.
  • Both next dev and Standalone modes work as intended.
  • Using next start works as intended when executing the bin directly (./node_modules/.bin/next start or node node_modules/.bin/next start)
  • The recommendation in Docker should probably be to use Standalone mode (this is consistent with most the examples I've seen)
  • ⚠️ Using next start via a package manager should be thought of as advanced usage and YMMV (at least in regards to graceful shutdowns). This is probably most often used locally to test a production build and most users won't notice any issues with this.

@redbmk redbmk marked this pull request as draft December 31, 2023 00:57
@huozhi
Copy link
Member

huozhi commented Jan 7, 2024

Previous was flaky, pnpm test test/integration/api-support/test/index.test.js seems failing after re-run the CI

@@ -279,11 +278,11 @@ export async function startServer(
// This is the render worker, we keep the process alive
console.error(err)
}
process.on('exit', (code) => cleanup(code))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we don't need to handle 'exit' anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exit event is sent right before the process exits and it can only execute synchronous code. I suppose we could capture it if we wanted to force the exit code to be 0, but I was thinking letting it send the code it got made sense.
https://nodejs.org/api/process.html#event-exit

There is no way to prevent the exiting of the event loop at this point, and once all 'exit' listeners have finished running the Node.js process will terminate.

Listener functions must only perform synchronous operations. The Node.js process will exit immediately after calling the 'exit' event listeners causing any additional work still queued in the event loop to be abandoned.

Since server.close is asychronous, I don't think calling it on the 'exit' event would do anything useful

https://nodejs.org/api/net.html#serverclosecallback

This function is asynchronous

I could see there maybe being a couple things that server.close does synchronously, but it definitely would exit without calling the callback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just realized it was already using the original exit code and just defaulting to 0, so the behavior should still be the same without listening for the event

@redbmk
Copy link
Contributor Author

redbmk commented Jan 9, 2024

Previous was flaky, pnpm test test/integration/api-support/test/index.test.js seems failing after re-run the CI

I'm starting to think we have the killApp test util send a SIGKILL by default just to ensure the app is killed right away, and then if any tests want to test another signal like SIGINT (e.g. the new tests or some of the CLI tests) they can just pass that as an extra param to killApp. Kinda feels like there are some cases where it doesn't quite catch the exit signal for some reason and then it times out.

Also given how many times these tests have failed even though it's not the same ones every time, I'm kinda thinking once they all pass we wait a bit and merge canary into this a few times and make sure they pass every time, just to make sure all the tests passing isn't some fluke. Don't really want to merge this into canary only to have it reverted again 🙈

@redbmk
Copy link
Contributor Author

redbmk commented Jan 14, 2024

I'm trying to figure out why test prod (3/5) / build would have failed in e49fb6d, since it seems unrelated to the stuff I did.

When I re-ran the failed test locally, both in my branch and on `canary`, it fails the same way both times.
test/production/prerender-prefetch/index.test.ts

❯ pnpm testheadless test/production/prerender-prefetch/index.test.ts -t "unstable_skip"

 FAIL  test/production/prerender-prefetch/index.test.ts (133.785 s)
  Prerender prefetch
    with optimisticClientCache enabled
      ✓ should update cache using prefetch with unstable_skipClientCache (3556 ms)
      ✓ should update cache using router.push with unstable_skipClientCache (2268 ms)
      ○ skipped should not revalidate during prefetching
      ○ skipped should trigger revalidation after navigation
      ○ skipped should attempt cache update on link hover/touch start
      ○ skipped should handle failed data fetch and empty cache correctly
    with optimisticClientCache disabled
      ✓ should update cache using prefetch with unstable_skipClientCache (2391 ms)
      ✕ should update cache using router.push with unstable_skipClientCache (2276 ms)
      ○ skipped should not revalidate during prefetching
      ○ skipped should trigger revalidation after navigation
      ○ skipped should not attempt client cache update on link hover/touch start
      ○ skipped should handle failed data fetch and empty cache correctly

  ● Prerender prefetch › with optimisticClientCache disabled › should update cache using router.push with unstable_skipClientCache

    expect(received).not.toBe(expected) // Object.is equality

    Expected: not 1705179893593

      154 |         await browser.elementByCss('#props').text()
      155 |       ).now
    > 156 |       expect(newTime).not.toBe(startTime)
          |                           ^
      157 |       expect(isNaN(newTime)).toBe(false)
      158 |     })
      159 |

      at Object.toBe (production/prerender-prefetch/index.test.ts:156:27)

Test Suites: 1 failed, 1 total
Tests:       1 failed, 8 skipped, 3 passed, 12 total
Snapshots:   0 total
Time:        133.827 s, estimated 158 s
Ran all test suites matching /test\/production\/prerender-prefetch\/index.test.ts/i with tests matching "unstable_skip".

Is this a known flaky test? It looks like when I updated to the latest canary the tests are now passing, but that only included two extra commits.

I tried running that test locally on the latest canary 10 times in a row and of those 10, should update cache using router.push with unstable_skipClientCache failed every time with optimisticClientCache disabled, and 8/10 times for the tests where it's enabled.

@redbmk
Copy link
Contributor Author

redbmk commented Jan 14, 2024

I'm not sure this PR would affect that code, but I just want to be sure I'm not breaking anything. Once bitten, twice shy 😂

I'll try keeping this PR up to date with canary as new commits come in to see if any other failures start cropping up.

@redbmk
Copy link
Contributor Author

redbmk commented Jan 15, 2024

@huozhi what do you make of these tests suites? seems like the PR just goes back and forth between passing and failing whenever I pull in more commits from canary

they seem to be different tests every time too.

The latest one even includes one from the new tests I wrote, but I can't get it to fail running locally, and it hasn't failed in any of these previous attempts. Maybe there's something about the way it runs the tests concurrently in CI?

Copy link
Member

@huozhi huozhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@huozhi huozhi merged commit d08b3ff into vercel:canary Jan 16, 2024
66 checks passed
@redbmk redbmk deleted the graceful-shutdown branch January 16, 2024 21:10
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 31, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CI approved Approve running CI for fork locked type: next
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Close http server in cleanup
3 participants