Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Last Call mode for Envoy [AKA graceful drain for drain manager on admin endpoint] #10592

Closed
auni53 opened this issue Mar 31, 2020 · 6 comments
Labels
area/connection design proposal Needs design doc/proposal before implementation stale stalebot believes this issue/PR has not been touched recently

Comments

@auni53
Copy link
Contributor

auni53 commented Mar 31, 2020

Last Call mode

Description

In our loadbalancer, each host should execute a graceful shutdown process. The process should discourage new and hanging connections, eventually close off new connections, and prepare the process to shut down with as few hard disconnects as possible. I propose a Last Call mode, where different parts of the Envoy server can know the server is shutting down and change behavior accordingly.

Design doc

Comment on this doc or on the thread. It should be openly commentable, but let me know if there are permission issues.

Proposed steps

  • Read the design doc, provide feedback on the idea of an Envoy last call mode
  • Update the design doc if needed, then start on the feature work. Some of the sub-features may be complicated enough to warrant separate design docs.

Related issues

@mattklein123
Copy link
Member

@auni53 can you make the design doc world commentable? I can't open it.

Also, FYI, we already have the "drain manager" concept which already does some of this, so curious how that fits into this proposal.

@auni53
Copy link
Contributor Author

auni53 commented Mar 31, 2020

Ah, had it org-commentable rather than global-commentable. Hopefully fixed.

Understanding the existing draining infrastructure has been a little confusing, different parts of the docs/code refer to draining at different parts of the system. Part of the Last Call mode would be to clarify how the API interacts with things like the drain manager, or potentially making enhancements to it. But that gets into implementation of the Last Call feature requests beyond the existence of the mode itself.

@mattklein123 mattklein123 added design proposal Needs design doc/proposal before implementation area/connection labels Mar 31, 2020
@mattklein123
Copy link
Member

Thanks @auni53. I left some comments on the doc. Much of the proposal is already implemented either as part of hot restart or as part of /healthcheck/fail. I think we need to circle back and better clarify what is already implemented vs. what is new. Happy to discuss in a short meeting also if needed. Thank you!

@auni53
Copy link
Contributor Author

auni53 commented Mar 31, 2020

Thanks Matt, yeah I didn't make that clear, but part of why I wrote the doc was just to clarify what our requirements were because I was having a hard time telling what was or wasn't currently available. Your comments with references to specific mechanisms are really helpful, I'll investigate those and circle back.

@stale
Copy link

stale bot commented Apr 30, 2020

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Apr 30, 2020
@auni53 auni53 closed this as completed May 1, 2020
@auni53 auni53 changed the title Proposal: Last Call mode for Envoy Proposal: Last Call mode for Envoy [AKA graceful drain for drain manager on admin endpoint] Mar 16, 2021
@nikita2206
Copy link

Hey there, was looking at this issue recently to understand how we should do graceful shutdowns in our org. My understanding is that there’s still one missing piece in Envoy, which is described in the doc:

Terminate idle connections that are in the header phase.
Basically, for the duration of Last Call, find idle connections that are still in the header phase and terminate them

Our situation is running Envoy as a sidecar in each Pod, for service-service communication. Services talk in plain HTTP and tend to keep TCP connections open for a while (http keep alive).
My goal is to get graceful shutdown working, while either avoiding arbitrary sleeps in the preStop hooks, or at least minimizing the amount of time we have to sleep there, as it slows down rolling deployments.

@auni53 did you find an alternative solution here for idle TCP connections? I know that Envoy now implements GOAWAY but I’m not sure if we should switch to HTTP2C for service-service comms just for clean shutdowns.

Ideally we want to be able to shutdown all idle TCP connections in the earliest phase, while letting the non-idle connections handle their in-flight HTTP requests.

@mattklein123 do you know if some of this was already implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connection design proposal Needs design doc/proposal before implementation stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

3 participants