Distinguish wait-shutdown command from standard k8s SIGTERM #6287

rittneje · 2020-10-06T14:13:07Z

As of v0.26.0, we can specify /wait-shutdown as a pre-stop hook in our deployment spec. As currently implemented, it sends SIGTERM to the nginx-ingress-controller process.

ingress-nginx/cmd/waitshutdown/main.go

Line 29 in 59a7f51

    
           err := exec.Command("bash", "-c", "pkill -SIGTERM -f nginx-ingress-controller").Run()

Since SIGTERM is also what k8s sends when shutting down a pod, there is no way for us to really tell for sure from the logs whether the pre-stop hook was properly configured or not. For this reason, it would be better if wait-shutdown used a different signal.

/kind feature

fejta-bot · 2021-01-04T14:29:52Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

rittneje · 2021-01-04T15:05:35Z

/remove-lifecycle stale

fejta-bot · 2021-04-04T15:42:07Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

rittneje · 2021-04-04T15:46:22Z

/remove-lifecycle stale

fejta-bot · 2021-07-03T15:48:51Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

rittneje · 2021-07-03T16:30:34Z

/remove-lifecycle stale

strongjz · 2021-08-17T16:56:40Z

If we add logging about the shutdown, would that be enough?

iamNoah1 · 2021-08-17T16:57:42Z

@rittneje reminder

rittneje · 2021-08-17T18:10:35Z

@strongjz Yes, assuming those logs would be sent to the main container's stderr/stdout.

iamNoah1 · 2021-08-26T14:21:12Z

/triage accepted
/lifecycle active
/priority backlog

k8s-triage-robot · 2021-11-24T14:47:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

iamNoah1 · 2021-11-24T16:51:07Z

/lifecycle active
/good-first-issue

k8s-ci-robot · 2021-11-24T16:51:08Z

@iamNoah1:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/lifecycle active
/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

deepakmega · 2021-12-13T13:13:30Z

/assign

k8s-triage-robot · 2022-04-18T18:14:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rittneje · 2022-04-18T21:16:58Z

/remove-lifecycle stale

k8s-triage-robot · 2022-07-17T21:19:18Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rittneje · 2022-07-17T21:21:34Z

/remove-lifecycle stale

k8s-triage-robot · 2022-10-15T22:10:33Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rittneje · 2022-10-15T22:40:47Z

/remove-lifecycle stale

loveRhythm1990 · 2022-11-09T06:26:59Z

@deepakmega This is why I had originally suggested using a different signal. :)

Also one thing I am currently confused about is what the point of /wait-shutdown is. All it does is send SIGTERM to nginx-ingress-controller, and then wait for nginx to stop.

ingress-nginx/cmd/waitshutdown/main.go

Lines 29 to 42 in e625c25

err := exec.Command("bash", "-c", "pkill -SIGTERM -f nginx-ingress-controller").Run()

if err != nil {

klog.ErrorS(err, "terminating ingress controller")

os.Exit(1)

}

// wait for the NGINX process to terminate

timer := time.NewTicker(time.Second * 1)

for range timer.C {

if !nginx.IsRunning() {

timer.Stop()

break

}

}

But nginx-ingress-controller will also wait for nginx to stop before it exits.

ingress-nginx/internal/ingress/controller/nginx.go

Lines 398 to 406 in e625c25

// wait for the NGINX process to terminate

timer := time.NewTicker(time.Second * 1)

for range timer.C {

if !nginx.IsRunning() {

klog.InfoS("NGINX process has stopped")

timer.Stop()

break

}

}

Consequently, it seems that just letting k8s terminate the pod normally (without any pre-stop) hook will have the same effect, in which case, why was /wait-shutdown added? The PR that introduced it (#4487) doesn't explain.

I am also confused by preStopHook wait-shutdown and cannot get its points, in normal existing process, container will also receive SIGTERM signal and does the same thing, wait-shutdown make things complicated.
should we delete this proStopHook ?

gmiam · 2023-03-08T15:37:23Z

Hi 👋

Ok, maybe I'm completely wrong here so sorry in advance if it's the case.

I was investigating a potential problem we have (a service receiving some 504 errors, apparently linked to ingress-controller downscaling event in another cluster), so I was looking at how nginx handled SIGTERM signals. And apparently it handles it as a SIGKILL and use SIGQUIT for graceful shutdown.

At the end of 2020 nginx project change all their docker images to use SIGQUIT instead of SIGTERM as a stop signal (see commit).
I'm wondering if this wait-shutdown command is not here to handle this in k8s context. But maybe there was some confusion at some point and it is using SIGTERM also.

Is a graceful shutdown the target for nginx-controller? As intended by default by kubernetes?

hanikesn · 2023-06-20T10:38:35Z

wait-shutdown make things complicated.
should we delete this proStopHook ?

We had to disable the preStopHook to get proper shutdown behavior with long running GRPC traffic and have been running for months successfully.

Ghilteras · 2023-07-13T17:47:40Z

we have the same issue here, with this wait-shutdown script nginx gets killed right away while the pod is still getting traffic from upstream and we cause thousands of GRPC errors.

cc @iamNoah1 @rittneje

nvtkaszpir · 2023-07-13T20:07:33Z

(Related to gracerful shutdown and not the signal processing, sorry for long post with updates)

I guess the easiest way is to use --shutdown-grace-period 10

ingress-nginx/pkg/flags/flags.go

Line 208 in a581a7b

    
           shutdownGracePeriod = flags.Int("shutdown-grace-period", 0, "Seconds to wait after receiving the shutdown signal, before stopping the nginx process.")

or longer time + long terminationGracePeriod so that endpoit would be unregistered from the k8s (thus not getting new connections) and existing grpc would be gracefully closed.

Other options:
I guess it should be something more like a:

sleep 10; nginx -s quit

we had that as shell for months with nginx only without any issues.
Sleep in the beginning ensures endpoint is unregistered from k8s service so it will not get more new traffic, and then it can process of graceful termination of the active connections.

So the wait-for-shutdown could just send pkill -SIGQUIT-f nginx-ingress-controller maybe?
Or just add sleep before pkill (but see the top of this comment for better alternative).

Also, nginx supports graceful quit for over 3y now:

http://nginx.org/en/docs/control.html
https://trac.nginx.org/nginx/ticket/753
even openresty supports it https://github.com/openresty/docker-openresty/blob/master/bullseye/Dockerfile#L67

Also see #6928

rittneje · 2023-07-14T11:12:52Z

@nvtkaszpir Please note that SIGTERM is only sent to nginx-ingress-controller (i.e., pid 1), not to nginx. It handles this signal by first sleeping for shutdown-grace-period seconds (default 0), invoking nginx -s quit and then sleeping for post-shutdown-grace-period seconds (default 10).

ingress-nginx/pkg/util/process/sigterm.go

Lines 32 to 49 in 5b35651

    
           func HandleSigterm(ngx ProcessController, delay int, exit exiter) { 
        
           	signalChan := make(chan os.Signal, 1) 
        
           	signal.Notify(signalChan, syscall.SIGTERM) 
        
           	<-signalChan 
        
           	klog.InfoS("Received SIGTERM, shutting down") 
        
           	exitCode := 0 
        
           	if err := ngx.Stop(); err != nil { 
        
           		klog.Warningf("Error during shutdown: %v", err) 
        
           		exitCode = 1 
        
           	} 
        
           	klog.Infof("Handled quit, delaying controller exit for %d seconds", delay) 
        
           	time.Sleep(time.Duration(delay) * time.Second) 
        
           	klog.InfoS("Exiting", "code", exitCode) 
        
           	exit(exitCode) 
        
           }

ingress-nginx/internal/ingress/controller/nginx.go

Lines 367 to 415 in 5b35651

    
           func (n *NGINXController) Stop() error { 
        
           	n.isShuttingDown = true 
        
           	n.stopLock.Lock() 
        
           	defer n.stopLock.Unlock() 
        
           	if n.syncQueue.IsShuttingDown() { 
        
           		return fmt.Errorf("shutdown already in progress") 
        
           	} 
        
           	time.Sleep(time.Duration(n.cfg.ShutdownGracePeriod) * time.Second) 
        
           	klog.InfoS("Shutting down controller queues") 
        
           	close(n.stopCh) 
        
           	go n.syncQueue.Shutdown() 
        
           	if n.syncStatus != nil { 
        
           		n.syncStatus.Shutdown() 
        
           	} 
        
           	if n.validationWebhookServer != nil { 
        
           		klog.InfoS("Stopping admission controller") 
        
           		err := n.validationWebhookServer.Close() 
        
           		if err != nil { 
        
           			return err 
        
           		} 
        
           	} 
        
           	// send stop signal to NGINX 
        
           	klog.InfoS("Stopping NGINX process") 
        
           	cmd := n.command.ExecCommand("-s", "quit") 
        
           	cmd.Stdout = os.Stdout 
        
           	cmd.Stderr = os.Stderr 
        
           	err := cmd.Run() 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	// wait for the NGINX process to terminate 
        
           	timer := time.NewTicker(time.Second * 1) 
        
           	for range timer.C { 
        
           		if !nginx.IsRunning() { 
        
           			klog.InfoS("NGINX process has stopped") 
        
           			timer.Stop() 
        
           			break 
        
           		} 
        
           	} 
        
           	return nil 
        
           }

In practice, 0 is a bad default value for shutdown-grace-period. On AWS we have it set to 3 minutes because it takes that long for the pod to get de-registered from the NLB. I'm not sure what the point of post-shutdown-grace-period is (see #8095).

nvtkaszpir · 2023-07-14T12:42:41Z

you're right, sigterm goes to the controller, which then calls quit in nginx

ingress-nginx/internal/ingress/controller/nginx.go

Line 396 in 5b35651

cmd := n.command.ExecCommand("-s", "quit")

Ghilteras · 2023-07-18T17:18:25Z

I just tested --shutdown-grace-period of 240s and --post-shutdown-grace-period of 30s and I actually get much more errors on scale down events compared to just using a sleep preStop so we are clearly missing something here and currently there's no way for us to make the nginx ingress controller deployment gracefully stop when clients have long lived GRPC connections, they get a bunch of these errors and the ingress will not sever the connection forcing the clients to go to a non Terminating pod instead

rpc error: code = Unavailable desc = error reading from server: EOF 
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:9099: connect: connection refused"  
rpc error: code = Unavailable desc = error reading from server: unexpected EOF  
rpc error: code = Unavailable desc = the connection is draining 
rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR
rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: unexpected EOF", received prior goaway: code: NO_ERROR
rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: EOF"
rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: read tcp 100.112.59.120:42704->10.28.7.194:443: read: connection reset by peer"
rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: read tcp 127.0.0.1:42956->127.0.0.1:9099: read: connection reset by peer"  
rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: read tcp 127.0.0.1:54274->127.0.0.1:9099: read: connection reset by peer"

k8s-triage-robot · 2024-07-17T18:15:37Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

/wait-shutdown preStop script's only job is to send SIGTERM to nginx-ingress-controller, which is PID 1, so it's the same with or without in Kubernetes environments. See kubernetes#6287 for discussion.

rittneje added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 6, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 3, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 3, 2021

k8s-ci-robot added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Aug 26, 2021

k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. labels Nov 24, 2021

k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. labels Apr 18, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2022

yodatak mentioned this issue Jul 31, 2023

Dump core in k3s Process (ctr) of user 0 dumped core. k3s-io/k3s#7937

Closed

mrolmedo mentioned this issue Jun 7, 2024

Kubernetes upgrade of the local cluster causes Rancher UI downtime rancher/rancher-docs#1330

Open

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jul 17, 2024

motoki317 mentioned this issue Nov 18, 2024

Pod Termination handling kicks in before the ingress controller has had time to process kubernetes/kubernetes#106476

Open

motoki317 linked a pull request Nov 21, 2024 that will close this issue

Improve shutdown logic: Wait until no requests are made #12397

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish wait-shutdown command from standard k8s SIGTERM #6287

Distinguish wait-shutdown command from standard k8s SIGTERM #6287

rittneje commented Oct 6, 2020

fejta-bot commented Jan 4, 2021

rittneje commented Jan 4, 2021

fejta-bot commented Apr 4, 2021

rittneje commented Apr 4, 2021

fejta-bot commented Jul 3, 2021

rittneje commented Jul 3, 2021

strongjz commented Aug 17, 2021

iamNoah1 commented Aug 17, 2021

rittneje commented Aug 17, 2021

iamNoah1 commented Aug 26, 2021

k8s-triage-robot commented Nov 24, 2021

iamNoah1 commented Nov 24, 2021

k8s-ci-robot commented Nov 24, 2021

deepakmega commented Dec 13, 2021

k8s-triage-robot commented Apr 18, 2022

rittneje commented Apr 18, 2022

k8s-triage-robot commented Jul 17, 2022

rittneje commented Jul 17, 2022

k8s-triage-robot commented Oct 15, 2022

rittneje commented Oct 15, 2022

loveRhythm1990 commented Nov 9, 2022

gmiam commented Mar 8, 2023

hanikesn commented Jun 20, 2023

Ghilteras commented Jul 13, 2023 •

edited

Loading

nvtkaszpir commented Jul 13, 2023 •

edited

Loading

rittneje commented Jul 14, 2023 •

edited

Loading

nvtkaszpir commented Jul 14, 2023

Ghilteras commented Jul 18, 2023

k8s-triage-robot commented Jul 17, 2024

Distinguish wait-shutdown command from standard k8s SIGTERM #6287

Distinguish wait-shutdown command from standard k8s SIGTERM #6287

Comments

rittneje commented Oct 6, 2020

fejta-bot commented Jan 4, 2021

rittneje commented Jan 4, 2021

fejta-bot commented Apr 4, 2021

rittneje commented Apr 4, 2021

fejta-bot commented Jul 3, 2021

rittneje commented Jul 3, 2021

strongjz commented Aug 17, 2021

iamNoah1 commented Aug 17, 2021

rittneje commented Aug 17, 2021

iamNoah1 commented Aug 26, 2021

k8s-triage-robot commented Nov 24, 2021

iamNoah1 commented Nov 24, 2021

k8s-ci-robot commented Nov 24, 2021

Guidelines

deepakmega commented Dec 13, 2021

k8s-triage-robot commented Apr 18, 2022

rittneje commented Apr 18, 2022

k8s-triage-robot commented Jul 17, 2022

rittneje commented Jul 17, 2022

k8s-triage-robot commented Oct 15, 2022

rittneje commented Oct 15, 2022

loveRhythm1990 commented Nov 9, 2022

gmiam commented Mar 8, 2023

hanikesn commented Jun 20, 2023

Ghilteras commented Jul 13, 2023 • edited Loading

nvtkaszpir commented Jul 13, 2023 • edited Loading

rittneje commented Jul 14, 2023 • edited Loading

nvtkaszpir commented Jul 14, 2023

Ghilteras commented Jul 18, 2023

k8s-triage-robot commented Jul 17, 2024

Ghilteras commented Jul 13, 2023 •

edited

Loading

nvtkaszpir commented Jul 13, 2023 •

edited

Loading

rittneje commented Jul 14, 2023 •

edited

Loading