add shutdown_delay option to executor & gRPC GracefulStop #3711

asobrien · 2021-10-29T00:59:32Z

What this PR does / why we need it:

Several changes changes are introduced in the PR, the individual commits reflect the changes and are described in more detail below; in summary and order of importance the changes are:

(3b098ea) Introduces shutdown_delay option in executor and adds graceful termination to gRPC server.
(33b04da) Fixes maxproc logging.
(b9122dc) Return error if ENGINE_PREDICTOR env var is not found when getPredictorFromEnv() is called.

gRPC graceful shutdown

The executor has a graceful_timeout (default: 15s) option, which allows for graceful termination of the http (REST) server through the net/http Server's Shutdown() method. When there is no connections to the http server, the Shutdown() doesn't block and os.Exit() is immediately called. The gRPC server which is running in the main goroutine is abruptly terminated without calling its associated GracefulStop() method to drain active connections.

The changes introduced here run both the http and gRPC servers in their own goroutines and use per-server channels to signal shutdown. The context.WithTimeout() created from the graceful_timeout option is moved out of runHttpServer and is used within the waitForShutdown() function. While the http server can now block for a period longer than the graceful timeout, the server will still be stopped when the program exits. The gRPC server does not use a context in its graceful shutdown method: the graceful timeout expiry is handled within waitForShutdown().

server shutdown_delay

We run several Seldon models on a kubernetes cluster, and have configured Seldon to use Ambassador as a gateway. When we do RollingUpdates (redeployments) of the model we see a significant increase in error rate, as measured by an upstream service. The errors look something of the form:

<_Rendezvous of RPC that terminated with:	
status = StatusCode.UNAVAILABLE	
details = "upstream connect error or disconnect/reset before headers. reset reason: local reset"	
debug_error_string = "
{
  "created":"@1635250379.345962743",
  "description":"Error received from peer ipv4:xxx.xxx.xxx.xxx:80",
  "file":"src/core/lib/surface/call.cc",
  "file_line":1041,
  "grpc_message":"upstream connect error or disconnect/reset before headers. reset reason: local reset",
  "grpc_status":14
}"
>

These errors are seen in our upstream service whenever one of the pods (running a seldon-container-engine) is deleted. Some simple testing revealed that traffic is still directed at the terminating pod for a period after it has already received a SIGTERM and shutdown, this leads to the errors we see above. We deploy the seldondeployment CRD with a seldon.io/headless-svc: "true" annotation. What we think is happening is that Envoy (via Ambassador) continues to keep the terminated pod's IP in rotation even after the pod is rotated. Some rudimentary measurements suggest that it can take over 10 seconds for a terminated pod to stop showing up in DNS record for the service. Additionally, by default, Envoy refreshes from DNS every 5 seconds.

We found that we could delete a pod without seeing gRPC errors if we utilized a shutdown_delay of 30s. By default this new option is set to 0, so there is no change in existing behavior.

maxproc logging

The logger.Info passed into maxprocs's Set() method does not behave like fmt.Printf() which leads to panic-level errors in the logs:

{"level":"dpanic","ts":1635457222.747586,"logger":"entrypoint","msg":"odd number of arguments passed as key-value pairs for logging","ignored key":"1","stacktrace":"go.uber.org/automaxprocs/maxprocs.(*config).log\n\t/src/go/pkg/mod/go.uber.org/[email protected]/maxprocs/maxprocs.go:47\ngo.uber.org/automaxprocs/maxprocs.Set\n\t/src/go/pkg/mod/go.uber.org/[email protected]/maxprocs/maxprocs.go:101\nmain.main\n\t/src/seldon-core/executor/cmd/executor/main.go:310\nruntime.main\n\t/opt/local/lib/go/src/runtime/proc.go:255"}

This is because after the first argument logr expects paired arguments that are formatted as keyval pairs. The change here adds an inline function so that only a single argument is passed to logger.Info (a formatted string) and the above error is no longer generated. With this change, the above log line now appears as:

{"level":"info","ts":1635467960.469512,"logger":"entrypoint.maxprocs","msg":"maxprocs: Honoring GOMAXPROCS=\"1\" as set in environment"}

missing ENGINE_PREDICTOR env var

The getPredictorFromEnv() function is modified to error in a similar manner as the getPredictorFromFile(). If a predictor can't be loaded from a file (because the file is missing or unsupported) that function returns an error. If no filename is specified a predictor will be loaded from the environment. However, if the ENGINE_PREDICTOR env var is missing an nil pointer is returned which leads to a subsequent panic. This change returns an error if the getPredictorFromEnv() function is called and no corresponding ENGINE_PREDICTOR env var has been set.

Which issue(s) this PR fixes: None.

The issue is described within this PR.

Special notes for your reviewer: None.

Does this PR introduce a user-facing change?: Yes.

Adds shutdown_delay option to executor

seldondev · 2021-10-29T00:59:39Z

Hi @asobrien. Thanks for your PR.

I'm waiting for a SeldonIO or todo member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the jenkins-x/lighthouse repository.

asobrien · 2021-10-29T01:02:27Z

/assign @adriangonz

ukclivecox · 2021-10-29T07:39:56Z

/test integration

ukclivecox · 2021-10-29T07:40:04Z

/test notebooks

ukclivecox · 2021-10-29T07:40:43Z

@asobrien Thanks for these improvements! We will run integration and notebook tests and have a look. @axsaucedo

ukclivecox · 2021-10-29T10:06:16Z

/test integration

ukclivecox · 2021-10-29T10:06:23Z

/test notebooks

asobrien · 2021-10-29T20:10:24Z

/retest

seldondev · 2021-10-29T20:10:27Z

@asobrien: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the jenkins-x/lighthouse repository.

axsaucedo · 2021-10-29T20:17:17Z

/test integration

axsaucedo · 2021-10-29T20:17:21Z

/test notebooks

seldondev · 2021-10-29T22:32:21Z

@asobrien: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
integration	`b9122dc`	link	`/test integration`
notebooks	`b9122dc`	link	`/test notebooks`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the jenkins-x/lighthouse repository. I understand the commands that are listed here.

axsaucedo · 2021-11-02T13:42:37Z

All integration tests pass - only two tests failed and they are confirmed flaky, merging - thanks for the contribution @asobrien
/approve

seldondev · 2021-11-02T13:43:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: axsaucedo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [axsaucedo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

) * shutdown delay & gRPC graceful shutdown * fix maxprocs logger * getPredictorFromEnv missing key error & tests

asobrien added 3 commits October 28, 2021 17:39

shutdown delay & gRPC graceful shutdown

3b098ea

fix maxprocs logger

33b04da

getPredictorFromEnv missing key error & tests

b9122dc

seldondev added needs-ok-to-test size/L labels Oct 29, 2021

seldondev assigned adriangonz Oct 29, 2021

asobrien changed the title ~~add shutdown_delay option to exector & gRPC GracefulStop~~ add shutdown_delay option to executor & gRPC GracefulStop Oct 29, 2021

axsaucedo merged commit 3a9e469 into SeldonIO:master Nov 2, 2021

seldondev added the approved label Nov 2, 2021

stephen37 pushed a commit to stephen37/seldon-core that referenced this pull request Dec 21, 2021

add shutdown_delay option to executor & gRPC GracefulStop (SeldonIO#3711

d6edb8d

) * shutdown delay & gRPC graceful shutdown * fix maxprocs logger * getPredictorFromEnv missing key error & tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add shutdown_delay option to executor & gRPC GracefulStop #3711

add shutdown_delay option to executor & gRPC GracefulStop #3711

asobrien commented Oct 29, 2021

seldondev commented Oct 29, 2021

asobrien commented Oct 29, 2021

ukclivecox commented Oct 29, 2021

ukclivecox commented Oct 29, 2021

ukclivecox commented Oct 29, 2021

ukclivecox commented Oct 29, 2021

ukclivecox commented Oct 29, 2021

asobrien commented Oct 29, 2021

seldondev commented Oct 29, 2021

axsaucedo commented Oct 29, 2021

axsaucedo commented Oct 29, 2021

seldondev commented Oct 29, 2021

axsaucedo commented Nov 2, 2021

seldondev commented Nov 2, 2021

add shutdown_delay option to executor & gRPC GracefulStop #3711

add shutdown_delay option to executor & gRPC GracefulStop #3711

Conversation

asobrien commented Oct 29, 2021

gRPC graceful shutdown

server shutdown_delay

maxproc logging

missing ENGINE_PREDICTOR env var

seldondev commented Oct 29, 2021

asobrien commented Oct 29, 2021

ukclivecox commented Oct 29, 2021

ukclivecox commented Oct 29, 2021

ukclivecox commented Oct 29, 2021

ukclivecox commented Oct 29, 2021

ukclivecox commented Oct 29, 2021

asobrien commented Oct 29, 2021

seldondev commented Oct 29, 2021

axsaucedo commented Oct 29, 2021

axsaucedo commented Oct 29, 2021

seldondev commented Oct 29, 2021

axsaucedo commented Nov 2, 2021

seldondev commented Nov 2, 2021