-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
graceful shutdown #60059
graceful shutdown #60059
Conversation
- remove immediate process.exit from `next start` - have handleSessionStop wait for the child to exit - add a test that verifies long-running api routes are not killed - send proper signal (e.g. SIGTERM or SIGINT) to child process from dev server - wait for app to shut down in tests, with a short default timeout
- when a SIGINT or SIGTERM is sent to next dev, shutdown the server immediately - create new test app for graceful-shutdown features - add some more tests
- shouldn't accept requests while server is shutting down and there's no activity - refactor some tests to use process.kill since it's faster and we aren't looking to kill the whole tree. makes killApp simpler too
- moved some js files to ts and fixed types - fixed standalone server graceful-shutdown - added new tests for production standalone mode
rather than having an integration test with dev and production modes, and then a separate test for standalone-mode that copies all the same tests, we can just follow the same pattern we had in integration but include the standalone-mode tests in the same file.
Allow CI Workflow Run
Note: this should only be enabled once the PR is ready to go and can only be enabled by a maintainer |
Allow CI Workflow Run
Note: this should only be enabled once the PR is ready to go and can only be enabled by a maintainer |
the tsconfig.json file was causing typescript issues when looking for e2e-utils and next-test-utils
@@ -108,12 +108,6 @@ if (process.env.NODE_ENV) { | |||
;(process.env as any).NODE_ENV = process.env.NODE_ENV || defaultEnv | |||
;(process.env as any).NEXT_RUNTIME = 'nodejs' | |||
|
|||
// Make sure commands gracefully respect termination signals (e.g. from Docker) | |||
// Allow the graceful termination to be manually configurable | |||
if (!process.env.NEXT_MANUAL_SIG_HANDLE && command !== 'dev') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing this one is necessary for next dev
and next start
to gracefully shutdown. The env is still in start-server
@@ -2049,13 +2049,6 @@ const dir = path.join(__dirname) | |||
process.env.NODE_ENV = 'production' | |||
process.chdir(__dirname) | |||
// Make sure commands gracefully respect termination signals (e.g. from Docker) | |||
// Allow the graceful termination to be manually configurable | |||
if (!process.env.NEXT_MANUAL_SIG_HANDLE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing this one is necessary for standalone mode to gracefully shutdown. The env is still in start-server
exitResolve() | ||
}) | ||
if (this.childProcess && !this.isStopping) { | ||
const exitPromise = once(this.childProcess, 'exit') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had started making changes here and turned out they weren't needed, but I thought I'd keep this refactor since once
seemed a little cleaner when I learned about it.
test/lib/next-test-utils.ts
Outdated
// Kill a launched app | ||
export async function killApp(instance: ChildProcess) { | ||
if (instance && instance.pid) { | ||
if (instance?.pid && isAppRunning(instance)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to kill the app if it's already running.
expect(isAppRunning(app)).toBe(true) | ||
|
||
// yield event loop to allow server to start the shutdown process | ||
await waitFor(20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was running into some issues with these tests. Waiting 20 ms seems to be the sweet spot but I'm a little worried there could be race conditions causing this to sometimes fail.
@@ -0,0 +1,6 @@ | |||
export const LONG_RUNNING_MS = 400 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had lowered this to 100
but seemed to run into less race conditions bumping it up higher. In practice, 400ms
wouldn't really be a "long running" request but should be enough to simulate the issue in tests.
@huozhi, pinging you since you reviewed the last PR recently. One thing I noticed was that one of the tests that had failed in the previous PR I was able to get to fail randomly on the latest (as of a week or so ago) If I ran it locally a few times it would sometimes pass and sometimes fail. Not sure if it's been fixed since then. The other tests that were failing though I think I've fixed |
I had marked this as a draft because I wanted to run a couple more manual tests (now complete and marked as ready):
✅ Tests are now passing locallyIt looks like there were two failing tests (the same test suite but with and without turbopack):test turbopack integration (5/5)
These are both passing now (at least locally for me):
There was also one that seemed to be sporadically failing, but running 20 times with
|
Previous was flaky, |
@@ -279,11 +278,11 @@ export async function startServer( | |||
// This is the render worker, we keep the process alive | |||
console.error(err) | |||
} | |||
process.on('exit', (code) => cleanup(code)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we don't need to handle 'exit'
anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exit
event is sent right before the process exits and it can only execute synchronous code. I suppose we could capture it if we wanted to force the exit code to be 0
, but I was thinking letting it send the code it got made sense.
https://nodejs.org/api/process.html#event-exit
There is no way to prevent the exiting of the event loop at this point, and once all
'exit'
listeners have finished running the Node.js process will terminate.
Listener functions must only perform synchronous operations. The Node.js process will exit immediately after calling the
'exit'
event listeners causing any additional work still queued in the event loop to be abandoned.
Since server.close
is asychronous, I don't think calling it on the 'exit'
event would do anything useful
https://nodejs.org/api/net.html#serverclosecallback
This function is asynchronous
I could see there maybe being a couple things that server.close
does synchronously, but it definitely would exit without calling the callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just realized it was already using the original exit code and just defaulting to 0, so the behavior should still be the same without listening for the event
I'm starting to think we have the Also given how many times these tests have failed even though it's not the same ones every time, I'm kinda thinking once they all pass we wait a bit and merge |
I'm trying to figure out why test prod (3/5) / build would have failed in e49fb6d, since it seems unrelated to the stuff I did. When I re-ran the failed test locally, both in my branch and on `canary`, it fails the same way both times.
Is this a known flaky test? It looks like when I updated to the latest canary the tests are now passing, but that only included two extra commits. I tried running that test locally on the latest |
I'm not sure this PR would affect that code, but I just want to be sure I'm not breaking anything. Once bitten, twice shy 😂 I'll try keeping this PR up to date with |
@huozhi what do you make of these tests suites? seems like the PR just goes back and forth between passing and failing whenever I pull in more commits from canary they seem to be different tests every time too. The latest one even includes one from the new tests I wrote, but I can't get it to fail running locally, and it hasn't failed in any of these previous attempts. Maybe there's something about the way it runs the tests concurrently in CI? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
startServer
function it calls attempt to stop the server onSIGINT
andSIGTERM
in different ways. This letsserver.js
yield tostartServer
startServer
was not waiting for the server to close before callingprocess.exit
. This lets it wait for any in-flight requests to finish processing before exiting the processSIGKILL
to the child process innext dev
, which should have the same effect of immediately shutting down the server onSIGTERM
orSIGINT
fixes: #53661
refs: #59551
Previously #59551 attempted to fix #53661, but had broken some tests in the process. It looks like the final commit was also missing an intended change to
utils.ts
. This should fix those issues as well as introduce a new set of tests for the graceful shutdown feature.In the last PR I was squashing and force-pushing updates along the way but it made it difficult to track the changes. This time I'm pushing quite a few commits to make it easier to track the changes and refactors I've made, with the idea that this should be squashed before being merged.