Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progressive traffic increase for new Pods #2296

Closed
costimuraru opened this issue Feb 28, 2020 · 17 comments · Fixed by #4772
Closed

Progressive traffic increase for new Pods #2296

costimuraru opened this issue Feb 28, 2020 · 17 comments · Fixed by #4772
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@costimuraru
Copy link

costimuraru commented Feb 28, 2020

We have a JVM-based web app behind Contour/Envoy/NLB, with horizontal pod auto scaling in place.
When a new pod gets created due to auto scaling, Contour/Envoy directs a proportional amount of traffic on that new pod. However, because the app is cold, we're seeing consistent timeouts until it warms up.

Screenshot 2020-02-28 17 36 47

We tried the same scenario by using a Service type LoadBalancer, in EKS (with an Elastic Load Balancer in front) and we don't see the same issue in this scenario. This seems to be because the ELB is doing a progressive traffic increase on the new pod, as the graph seen below.
Screenshot 2020-02-28 17 34 56

Is there any plan to support something similar in Contour? I see we have the possibility to set weights for different services in an IngressRoute. Would it be something to consider to set some weifghts at pod level for a given service, based on their age? (or is something like this available today)?

@youngnick
Copy link
Member

Thanks for logging this issue.

This sounds like a time where health checks from Contour or readiness checks from Kubernetes would help.

Kubernetes supports pod readiness checks, and Contour supports endpoint health checks, both of which could ensure that traffic does not get to a warmed instance, as long as your application can indicate that it's ready somehow.

Contour's endpoint health checks are only available in the HTTPProxy object ( and the now deprecated IngressRoute), however. Pod readiness checks are available in any recent version of Kubernetes.

@costimuraru
Copy link
Author

Thanks, @youngnick.
This sounds like we need to warm up the new pods ourselves. The issue was asking whether this could be handled by Contour/Envoy itself, by doing a progressive traffic increase on the new pod(s), hence warming up the instance.

@stevesloka
Copy link
Member

I agree with what @youngnick suggested. You could have your readiness probe call an endpoint which would trigger the app to warm up, but put an initial delay that matches the time your app needs to spin up.

Additionally, you could look at adding a retry to the requests, so if the request does fail, then it would get retried by Envoy.

I'm going to close this out, but please re-open if you have further questions on this @costimuraru !

@costimuraru
Copy link
Author

costimuraru commented Mar 2, 2020

Thanks for the response, @stevesloka

have your readiness probe call an endpoint which would trigger the app to warm up

I think we might not be on the same page regarding the warm up. The warm up is not related to the application being slow to start or anything like that. This is about the app warming up by processing (real) HTTP requests.

The scenario right now with Contour is:

  1. app starts on the new pod and is ready to handle requests (this happens quite fast)
  2. contour throws a lot of requests to the new pod
  3. app can't handle these many requests at once, being in a cold state and crashes

This problem is known and other load balancers have implemented algorithms to mitigate it. For example see this from the Application Load Balancer from AWS: https://aws.amazon.com/about-aws/whats-new/2018/05/application-load-balancer-announces-slow-start-support/

Application Load Balancers now support a slow start mode that allows you to add new targets without overwhelming them with a flood of requests. With the slow start mode, targets warm up before accepting their fair share of requests based on a ramp-up period that you specify

This issue is related exactly to this kind of behavior, where Contour would be able to support a slow start mode and not overwhelm new pods with requests.

@costimuraru
Copy link
Author

Hey, @youngnick, @stevesloka,

Any thoughts on the above?

Appreciate the feedback.

@youngnick
Copy link
Member

Hi @costimuraru, currently, Contour does minimal configuration of Envoy aside from what it's directed to do by Kubernetes objects.

If I understand what you're asking for - having Contour detect new endpoint pods and gradually shift traffic to them - this is a very large change to Contour's current model of using Envoy, as it would involve Contour keeping track of all the health of all the endpoints of the service, and gradually changing the weights of each endpoint after a given period, which is a very large departure from our current model.

I will speak to the team about this idea, we will need to double check if Envoy has any feature that would make adding this feature to Contour easier.

@youngnick
Copy link
Member

In addition, I think what @stevesloka and I were trying to suggest earlier is having the readiness check do some common requests to the app itself to warm the caches before marking the pod as ready for traffic.

@costimuraru
Copy link
Author

costimuraru commented Mar 16, 2020

Thanks for the detailed answer, @youngnick!

In addition, I think what @stevesloka and I were trying to suggest earlier is having the readiness check do some common requests to the app itself to warm the caches before marking the pod as ready for traffic.

We tried this, but the number of requests is just too low to do any real warming (we're trying to warm up from 0 to ~4000 requests per second, for each pod). We also tried adding a PostStart lifecycle hook on the Pod, where we'd run an http generator process to send requests to the app (via localhost), but this also is a problematic. The warm up takes quite a bit of time (eg. ~ 2 minutes), during which the Pod is not actually receiving any external traffic. Even if we add tens of pods due to a spike, we are not able to process the extra requests, because we need for this warm up period to finish (so we're back to the VM world, where it takes minutes to spin up a new machine).
It's also quite hard to generate requests that map to real life use cases, as these are frequently getting updated. All in one, doing this warmup workarounds add quite a lot of work and don't yield the best results.

@lrouquette
Copy link

@costimuraru - this is more an Envoy issue in my mind (Contour could leverage that feature of course, once implemented in Envoy). Have you considered filing the issue in the Envoy project instead?

@costimuraru
Copy link
Author

Thanks, @lrouquette. Created the issue in Envoy: envoyproxy/envoy#11050

@stevesloka stevesloka reopened this Dec 13, 2021
@stevesloka
Copy link
Member

This is available in Envoy now so Contour could adopt the feature!

From slack convo:

We'd need to just plan out the API features of how to implement. Probably would need to add to the services struct and add the slow-startup configuration:  https://github.com/projectcontour/contour/blob/main/apis/projectcontour/v1/httpproxy.go#L627

@skriss
Copy link
Member

skriss commented Dec 16, 2021

cc @CrossingTheRiverPeole

@skriss skriss added kind/feature Categorizes issue or PR as related to a new feature. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Dec 16, 2021
@skriss
Copy link
Member

skriss commented Dec 16, 2021

Added the help wanted label here if anyone is interested in picking up this issue!

@costimuraru
Copy link
Author

It would be very useful for us to have support for this new Envoy feature in Contour.

@tsaarni tsaarni self-assigned this Sep 20, 2022
@skriss skriss removed the lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. label Oct 4, 2022
@skriss skriss added this to Contour Oct 4, 2022
@skriss skriss added this to the 1.23.0 milestone Oct 4, 2022
@skriss skriss moved this to In Progress in Contour Oct 4, 2022
Repository owner moved this from In Progress to Done in Contour Oct 6, 2022
@tailrecur
Copy link

Thanks a lot for this !!

@tailrecur
Copy link

@skriss If I understand the Compatibility matrix correctly, this means that this change would get rolled in the next major release (1.23.0 ??) and the minimum supported K8s version for this release will be 1.23. Is this correct?

@sunjayBhatia
Copy link
Member

yes that is correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

8 participants