Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish wait-shutdown command from standard k8s SIGTERM #6287

Open
rittneje opened this issue Oct 6, 2020 · 32 comments · May be fixed by #12397
Open

Distinguish wait-shutdown command from standard k8s SIGTERM #6287

rittneje opened this issue Oct 6, 2020 · 32 comments · May be fixed by #12397
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@rittneje
Copy link

rittneje commented Oct 6, 2020

As of v0.26.0, we can specify /wait-shutdown as a pre-stop hook in our deployment spec. As currently implemented, it sends SIGTERM to the nginx-ingress-controller process.

err := exec.Command("bash", "-c", "pkill -SIGTERM -f nginx-ingress-controller").Run()

Since SIGTERM is also what k8s sends when shutting down a pod, there is no way for us to really tell for sure from the logs whether the pre-stop hook was properly configured or not. For this reason, it would be better if wait-shutdown used a different signal.

/kind feature

@rittneje rittneje added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 6, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2021
@rittneje
Copy link
Author

rittneje commented Jan 4, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2021
@rittneje
Copy link
Author

rittneje commented Apr 4, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 3, 2021
@rittneje
Copy link
Author

rittneje commented Jul 3, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 3, 2021
@strongjz
Copy link
Member

If we add logging about the shutdown, would that be enough?

@iamNoah1
Copy link
Contributor

@rittneje reminder

@rittneje
Copy link
Author

@strongjz Yes, assuming those logs would be sent to the main container's stderr/stdout.

@iamNoah1
Copy link
Contributor

/triage accepted
/lifecycle active
/priority backlog

@k8s-ci-robot k8s-ci-robot added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Aug 26, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. labels Nov 24, 2021
@iamNoah1
Copy link
Contributor

/lifecycle active
/good-first-issue

@k8s-ci-robot
Copy link
Contributor

@iamNoah1:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/lifecycle active
/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 24, 2021
@deepakmega
Copy link

/assign

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. labels Apr 18, 2022
@rittneje
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2022
@rittneje
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2022
@rittneje
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2022
@loveRhythm1990
Copy link
Contributor

@deepakmega This is why I had originally suggested using a different signal. :)

Also one thing I am currently confused about is what the point of /wait-shutdown is. All it does is send SIGTERM to nginx-ingress-controller, and then wait for nginx to stop.

err := exec.Command("bash", "-c", "pkill -SIGTERM -f nginx-ingress-controller").Run()
if err != nil {
klog.ErrorS(err, "terminating ingress controller")
os.Exit(1)
}
// wait for the NGINX process to terminate
timer := time.NewTicker(time.Second * 1)
for range timer.C {
if !nginx.IsRunning() {
timer.Stop()
break
}
}

But nginx-ingress-controller will also wait for nginx to stop before it exits.

// wait for the NGINX process to terminate
timer := time.NewTicker(time.Second * 1)
for range timer.C {
if !nginx.IsRunning() {
klog.InfoS("NGINX process has stopped")
timer.Stop()
break
}
}

Consequently, it seems that just letting k8s terminate the pod normally (without any pre-stop) hook will have the same effect, in which case, why was /wait-shutdown added? The PR that introduced it (#4487) doesn't explain.

I am also confused by preStopHook wait-shutdown and cannot get its points, in normal existing process, container will also receive SIGTERM signal and does the same thing, wait-shutdown make things complicated.
should we delete this proStopHook ?

@gmiam
Copy link

gmiam commented Mar 8, 2023

Hi 👋

Ok, maybe I'm completely wrong here so sorry in advance if it's the case.

I was investigating a potential problem we have (a service receiving some 504 errors, apparently linked to ingress-controller downscaling event in another cluster), so I was looking at how nginx handled SIGTERM signals. And apparently it handles it as a SIGKILL and use SIGQUIT for graceful shutdown.

At the end of 2020 nginx project change all their docker images to use SIGQUIT instead of SIGTERM as a stop signal (see commit).
I'm wondering if this wait-shutdown command is not here to handle this in k8s context. But maybe there was some confusion at some point and it is using SIGTERM also.

Is a graceful shutdown the target for nginx-controller? As intended by default by kubernetes?

@hanikesn
Copy link

wait-shutdown make things complicated.
should we delete this proStopHook ?

We had to disable the preStopHook to get proper shutdown behavior with long running GRPC traffic and have been running for months successfully.

@Ghilteras
Copy link

Ghilteras commented Jul 13, 2023

we have the same issue here, with this wait-shutdown script nginx gets killed right away while the pod is still getting traffic from upstream and we cause thousands of GRPC errors.

cc @iamNoah1 @rittneje

@nvtkaszpir
Copy link
Member

nvtkaszpir commented Jul 13, 2023

(Related to gracerful shutdown and not the signal processing, sorry for long post with updates)

I guess the easiest way is to use --shutdown-grace-period 10

shutdownGracePeriod = flags.Int("shutdown-grace-period", 0, "Seconds to wait after receiving the shutdown signal, before stopping the nginx process.")
or longer time + long terminationGracePeriod so that endpoit would be unregistered from the k8s (thus not getting new connections) and existing grpc would be gracefully closed.

Other options:
I guess it should be something more like a:

sleep 10; nginx -s quit

we had that as shell for months with nginx only without any issues.
Sleep in the beginning ensures endpoint is unregistered from k8s service so it will not get more new traffic, and then it can process of graceful termination of the active connections.

So the wait-for-shutdown could just send pkill -SIGQUIT-f nginx-ingress-controller maybe?
Or just add sleep before pkill (but see the top of this comment for better alternative).

Also, nginx supports graceful quit for over 3y now:

Also see #6928

@rittneje
Copy link
Author

rittneje commented Jul 14, 2023

@nvtkaszpir Please note that SIGTERM is only sent to nginx-ingress-controller (i.e., pid 1), not to nginx. It handles this signal by first sleeping for shutdown-grace-period seconds (default 0), invoking nginx -s quit and then sleeping for post-shutdown-grace-period seconds (default 10).

func HandleSigterm(ngx ProcessController, delay int, exit exiter) {
signalChan := make(chan os.Signal, 1)
signal.Notify(signalChan, syscall.SIGTERM)
<-signalChan
klog.InfoS("Received SIGTERM, shutting down")
exitCode := 0
if err := ngx.Stop(); err != nil {
klog.Warningf("Error during shutdown: %v", err)
exitCode = 1
}
klog.Infof("Handled quit, delaying controller exit for %d seconds", delay)
time.Sleep(time.Duration(delay) * time.Second)
klog.InfoS("Exiting", "code", exitCode)
exit(exitCode)
}

func (n *NGINXController) Stop() error {
n.isShuttingDown = true
n.stopLock.Lock()
defer n.stopLock.Unlock()
if n.syncQueue.IsShuttingDown() {
return fmt.Errorf("shutdown already in progress")
}
time.Sleep(time.Duration(n.cfg.ShutdownGracePeriod) * time.Second)
klog.InfoS("Shutting down controller queues")
close(n.stopCh)
go n.syncQueue.Shutdown()
if n.syncStatus != nil {
n.syncStatus.Shutdown()
}
if n.validationWebhookServer != nil {
klog.InfoS("Stopping admission controller")
err := n.validationWebhookServer.Close()
if err != nil {
return err
}
}
// send stop signal to NGINX
klog.InfoS("Stopping NGINX process")
cmd := n.command.ExecCommand("-s", "quit")
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
err := cmd.Run()
if err != nil {
return err
}
// wait for the NGINX process to terminate
timer := time.NewTicker(time.Second * 1)
for range timer.C {
if !nginx.IsRunning() {
klog.InfoS("NGINX process has stopped")
timer.Stop()
break
}
}
return nil
}

In practice, 0 is a bad default value for shutdown-grace-period. On AWS we have it set to 3 minutes because it takes that long for the pod to get de-registered from the NLB. I'm not sure what the point of post-shutdown-grace-period is (see #8095).

@nvtkaszpir
Copy link
Member

you're right, sigterm goes to the controller, which then calls quit in nginx

cmd := n.command.ExecCommand("-s", "quit")

@Ghilteras
Copy link

I just tested --shutdown-grace-period of 240s and --post-shutdown-grace-period of 30s and I actually get much more errors on scale down events compared to just using a sleep preStop so we are clearly missing something here and currently there's no way for us to make the nginx ingress controller deployment gracefully stop when clients have long lived GRPC connections, they get a bunch of these errors and the ingress will not sever the connection forcing the clients to go to a non Terminating pod instead

rpc error: code = Unavailable desc = error reading from server: EOF 
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:9099: connect: connection refused"  
rpc error: code = Unavailable desc = error reading from server: unexpected EOF  
rpc error: code = Unavailable desc = the connection is draining 
rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR
rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: unexpected EOF", received prior goaway: code: NO_ERROR
rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: EOF"
rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: read tcp 100.112.59.120:42704->10.28.7.194:443: read: connection reset by peer"
rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: read tcp 127.0.0.1:42956->127.0.0.1:9099: read: connection reset by peer"  
rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: read tcp 127.0.0.1:54274->127.0.0.1:9099: read: connection reset by peer"

@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jul 17, 2024
motoki317 added a commit to motoki317/ingress-nginx that referenced this issue Nov 21, 2024
/wait-shutdown preStop script's only job is to send SIGTERM to nginx-ingress-controller,
which is PID 1, so it's the same with or without in Kubernetes environments.

See kubernetes#6287 for discussion.
motoki317 added a commit to motoki317/ingress-nginx that referenced this issue Nov 21, 2024
/wait-shutdown preStop script's only job is to send SIGTERM to nginx-ingress-controller,
which is PID 1, so it's the same with or without in Kubernetes environments.

See kubernetes#6287 for discussion.
@motoki317 motoki317 linked a pull request Nov 21, 2024 that will close this issue
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

Successfully merging a pull request may close this issue.