Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progressive traffic increase for new Pods (slow start mode) #11050

Closed
costimuraru opened this issue May 4, 2020 · 12 comments · Fixed by #13176
Closed

Progressive traffic increase for new Pods (slow start mode) #11050

costimuraru opened this issue May 4, 2020 · 12 comments · Fixed by #13176
Assignees

Comments

@costimuraru
Copy link

Title: Support for progressive traffic increase for new Pods (slow start mode)

Description:
TL;DR; It would be useful to have a slow start mode that allows us to add new pods without overwhelming them with a flood of requests. Similar to this from AWS: https://aws.amazon.com/about-aws/whats-new/2018/05/application-load-balancer-announces-slow-start-support/

We have a JVM-based web app behind Contour/Envoy/NLB, with horizontal pod auto scaling in place.
When a new pod gets created due to auto scaling, Contour/Envoy directs a proportional amount of traffic on that new pod. However, when the app that has just started is overwhelmed with a flood of requests, we're seeing consistent timeouts until it warms up (a couple of minutes). Because of this, whenever we scale out our app, we're losing data. While discussing this with other teams inside Adobe, we've noticed this a common problem with JVM-based apps.

Any extra documentation required to understand the issue.

75562220-e68b6500-5a50-11ea-87fb-5d5d27ef98f7

(as you can see in the graph above, whenever a new pod gets created, requests start failing for a couple of minute)

We tried the same scenario by using a Service type LoadBalancer, in EKS (with an Elastic Load Balancer in front) and we don't see the issue. The ELB is doing a progressive traffic increase on the new pod, as the graph seen below.

75562049-a4622380-5a50-11ea-8354-9af1fa096180

(in the graph above, you can see the number of requests received by the new pod from the ELB, which is gradually increasing)
@mattklein123
Copy link
Member

This is something that I have wanted to add for quite some time. I think the easiest implementation would be to keep track of host addition and ramp time, and if this option is enabled, scale the host picks for RR and LR by some amount during the warm up period. cc @snowp @tonya11en

@nezdolik
Copy link
Member

@mattklein123 i would like to help with this

@mattklein123
Copy link
Member

@nezdolik awesome sounds great. Do you want to put together a short design doc on an implementation proposal?

@nezdolik
Copy link
Member

@mattklein123 will do

@nezdolik
Copy link
Member

@costimuraru @mattklein123 @snowp @tonya11en please take a look at RFC: https://docs.google.com/document/d/1NiG1X0gbfFChjl1aL-EE1hdfYxKErjJ2688wJZaj5a0/edit?usp=sharing

@mattklein123
Copy link
Member

Thanks @nezdolik for working on this! Overall looks great. There are a few comment threads to work through in the doc but very excited to see this being worked on.

@wbpcode
Copy link
Member

wbpcode commented Aug 21, 2020

is there any progress in this work? I very much hope that this work will be completed soon. If there is a need, perhaps I can help as well.

@Stono
Copy link

Stono commented Nov 4, 2020

As an organisation that is 75% Java; we'd love this.

@nightmareze1
Copy link

+1

@ejc3
Copy link

ejc3 commented Nov 6, 2020

This would be a great feature!

In the meantime, what we've done for some apps is basically run a little mini load test from within the pod to warm them up. Even with slow start, a per pod warm up would still be useful in situations where all Pods / VM's behind Envoy were restarted at the same time to ensure they can properly start serving traffic and won't instantly be overwhelmed when they are put into service.

@costimuraru
Copy link
Author

costimuraru commented Feb 26, 2021

the RFC looks great, @nezdolik. Is there anything that prevents us from implementing it?

@nezdolik
Copy link
Member

nezdolik commented Mar 8, 2021

@costimuraru there is an in progress PR: #13176, "slow start" is slowly moving forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants