☂️ Improve VPA Recommendations #47

vlerenc · 2020-05-27T09:27:57Z

What would you like to be added:
We would need better VPA recommendations for various cases of CPU and memory usage.

Why is this needed:
We see frequent outages of components such as the gardenlet when the load increases suddenly (spikes) and VPA only looks at the history instead of the actual usage.

Discussion:
See: https://sap-cp.slack.com/archives/GBVUBHM5K/p1590564020106000?thread_ts=1590540922.098400&cid=GBVUBHM5K

Roadmap

Include memory samples in TotalSamplesCount to react better to OOMKills #101
- Bug: there is a TotalSamplesCount variable which is calculated every time a usage sample is added to VPA recommender. The VPA recommender however updates it only for addition of CPU samples, but memory samples are not considered. During an OOM scenario, the cpu usage sample remains zero. This leads to TotalSamplesCount being 0. Hence even though VPA recommender records memory samples, it does not give any recommendation as the code flow checks for TotalSamplesCount > 0 to give a recommendation.
- This is a prerequisite for some of the other improvements mentioned below
- Unit tests
Add grafana dashboard to monitor OOMKills and CPU throttling behavior and recommendations
Configurable OOMBumpUpRatio and OOMMinBumpUp values, which are currently hardcoded
- While generally the VPA is able to help recover a pod from an OOMKill, the time it takes or number of restarts is sometimes unacceptable. We can reduce the number of restarts and subsequently the time taken to recover, by making these values customizable, and maybe even dynamically calculated
- The new ratio can either be defined as a static configuration in the VPA resource, or can be computed dynamically based on previous samples from last N minutes/samples before OOMKill (very high weightage to recent-most samples) to especially handle spikes. The dynamic method ensures that the VPA CRD remains untouched, but will require some simulations and tests to come up with a good algorithm for this
- Unit tests
In cases of container errors/restarts, do not recommend values lower than last utilization
- To avoid resource starvation for the container once it recovers from error
- For instance, recommendation = max(recommendation, last utilization or max(last N utilization snapshots))
- Or maybe also wait for N minutes/hours before deciding to recommend lower values
- This behaviour can become better with predictive scaling, as discussed later
Incorporate CPU throttling information, which is currently ignored
- VPA Recommender already has special behaviour to observe and record OOMKills with RecordOOM, via the oom_observer package. Similar package like cputhrottling_obsberver needs to be built to handle CPU spike and throttling cases
- The source for this metric must be configurable and optional, to retain backward compatibility
- Unit tests
Make all/most hardcoded values configurable (with via VPA resource or atleast as CLI flags)
- Especially required for better scale-down behaviour
- OOMBumpUpRatio, OOMMinBumpUp, TargetPercentile, memoryAggregationInterval, memoryAggregationIntervalCount, memoryHistogramDecayHalfLife, cpuHistogramDecayHalfLife, other thresholds, min-values, durations, etc
- Unit tests
Scale requests and limits separately
- Proportional scaling works for most but not all cases
- Find out in which cases it doesn’t work well (pod initializations, burst usage, etc)
- We don’t want ridiculously high limits like 80GB of memory, but we also don’t want too small values where the pod initialization itself fails
- For instance, limits recommendation = max or P99 of limits observations in last N samples or given time window
- Unit tests
Explore horizontal scaling using request rates
- We can then use VPA and HPA separately, without the need for HVPA
- This can be done only if VPA by itself is capable of handling all the special cases that HVPA can handle, and if it is as configurable as HVPA
- Unit tests
Individual pod scaling instead of entire controller resources (deployment/statefulset/replicaset)
- Only scale the pods that are being stressed
- With in-place resource scaling, can also be applied to container-level scaling
- Unit tests
Predictive scaling of resources using machine learning or deep learning
- Allows VPA Recommender to be proactive instead of reactive
- Resource-specific models would understand patterns better and provide better scaling recommendations
- For instance, daily hibernation schedule of Gardener shoots puts high load on the cluster's control plane at a fixed time everyday. With predictive scaling, VPA will be able to scale up the resources for these components few minutes before it gets stressed, so that potential downtime can be avoided
- Tests

The text was updated successfully, but these errors were encountered:

hardikdr · 2020-06-01T08:15:07Z

Hi Vedran, is this issue really meant to be for the cluster-autoscaler? Or probably fits better at hvpa ?

amshuman-kr · 2020-06-01T08:46:13Z

@hardikdr This is really for VPA. But there is no good place for filing issues for VPA. I think @vlerenc created it here because this is a fork of autoscaler and that technically includes VPA (though really we are forking only cluster-autoscaler).

It is currently open if we go for a custom VPA recommender of our own or enhance the existing VPA recommender.

gardener-robot · 2020-06-01T08:49:35Z

It is meant for VPA as in https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler, which we don't have. ;-) I wrote in the Slack channel, that I don't know where to open this one. @amshuman-kr was part of this discussion in Slack. HVPA doesn't fit, I think.

hardikdr · 2020-06-01T08:56:23Z

I see, thanks @amshuman-kr and @gardener-robot for the clarification .. :D

vlerenc · 2020-06-01T16:04:03Z

Yeah, I am sometimes incognito. ;-)

amshuman-kr · 2020-09-14T08:44:27Z

/assign @kallurbsk

amshuman-kr · 2020-09-21T10:37:08Z

Somehow "/assign" didn't work and I do not have rights to assign directly either. @prashanth26 @hardikdr @AxiomSamarth could you please help?

amshuman-kr · 2020-09-21T11:18:47Z

Adding a brief description of goals here because I don't have the rights to edit the issue description.

Goals

Pluggable in place of VPA recommender.
React better to CPU and memory usage spikes (especially, avoid recommending lower than resource usage)
React better to crashloops and possibly achieve automatic recovery if the crashes are triggered by lack of resources
Scale down faster if usage is consistently lower for some period of time (for the time-being configurable globally via CLI flags because we are re-using the VPA CRD)
Optionally support recommendations in annotations so that this can be used in parallel with VPA recommender and compared by enhancing vpa-exporter to additionally export recommendations from the annotations.

hardikdr · 2020-09-21T11:26:45Z

@kallurbsk needs to be added in the gardener org first, before assigning the issues to him.

timebertt · 2020-10-05T11:36:27Z

Will this be contributed upstream in VPA or will we maintain another fork / gardener specific implementation?

In a discussion w/ @rfranzke @vpnachev and some others, we were wondering if it would already help to make OOMBumpUpRatio and OOMMinBumpUp configurable.
Are you planning on doing so? Or is this not reasonable?

amshuman-kr · 2020-10-05T12:19:59Z

Will this be contributed upstream in VPA or will we maintain another fork / gardener specific implementation?

I am leaning towards our own implementation.

Considering that CLI flags for VPA recommender aggregations config. kubernetes/autoscaler#2905 (needed to improve costs in out seeds) is yet to be released since May 2020, I think we have no option but to go for a fork at least.
But in fact, I have a much simpler from-scratch implementation in mind (instead of a a fork) with a bias towards the present than the history and using a stabilization period for scale down (configurable via CLI flag, for the time being), which can make it work for both stable load as well as spikes.

amshuman-kr · 2020-10-06T08:58:14Z

/assign @kallurbsk

vlerenc · 2021-01-27T11:01:24Z

I think, it's not yet shared, the link to the document (WiP at the time of this writing): https://github.com/kallurbsk/autoscaler/blob/master/vertical-pod-autoscaler/docs/proposals/crash_loop_handling/README.md

vlerenc · 2021-01-28T04:06:35Z

@kallurbsk @amshuman-kr Some thoughts after reading the proposal:

If we double, won't we start seeing possibly a flapping behaviour?
It would be good to run a test/simulation of how the new recommender logic would fare in comparison to the current one on some real recorded issue(s) (is it giving more resources/the necessary resources when needed?; is it using more or potentially even less resources, because of the more aggressive scale-up resp. faster scale-down?)
The idea with no_of_crashes to determine when to double CPU and memory is an interesting idea/pragmatic one. Nice. How is the counter reset?
Would it be possible to put the configuration knobs and switches (instead or in addition (overrides) to the global (then default) settings) into the VPA resources to drive the behaviour of the recommender more individually (e.g. as annotations if nothing else/for the time being and until maybe the autoscaler community shows interest in adopting this recommender)?

kallurbsk · 2021-01-28T05:44:47Z

Thanks for sharing @vlerenc your comments and taking the pain to read through “not so clean” commit.
To quickly address the 2nd point, I will be submitting a simulation that we are doing manually of how the new recommender would be able to handle a CrashLoopBackOff caused due to resource contention peaking. I will also address other comments once I upload the last version from myside within a day.

amshuman-kr · 2021-01-28T08:12:04Z

If we double, won't we start seeing possibly a flapping behaviour?

It is really a trade-off between cost saving (by recommendation trying to closely hug the usage curve) vs reducing scaling based disturbance. If we keep the scale down window short (less than 1 or 2 hours) we save cost but risk multiple scalings during the day if there are multiple load spikes during the day. If we keep it longer we reduce disturbance but over-provision.

I think 1 or 2 hours is a good heuristic to start with. If there are multiple spikes during the day that last less than an hour, can we really afford to over-provision for the spikes even during long periods of low load.

This is just mental heuristics. We clearly have to learn about what time windows make sense, of course.

It would be good to run a test/simulation of how the new recommender logic would fare in comparison to the current one on some real recorded issue(s) (is it giving more resources/the necessary resources when needed?; is it using more or potentially even less resources, because of the more aggressive scale-up resp. faster scale-down?)

The first deliverable, in my mind, is that we should deploy the new custom recommender in a read only mode in parallel to the upstream VPA recommender. I.e. the custom recommender doesn't update the recommendation in the VPA status but in an annotation on the VPA resource which is then exported as a different metric by the vpa-exporter (also scraped and retained by prometheus). This way we can compare the recommendations and assess if anything needs to be improved. Once, we are confident with the recommendation, we deploy it in a more active way by updating the recommendation in the VPA status (and disabling the upstream VPA recommender).

@kallurbsk Can you please make this point (parallel deployment and assessment) explicit in the proposal?

The idea with no_of_crashes to determine when to double CPU and memory is an interesting idea/pragmatic one. Nice. How is the counter reset?

@kallurbsk Can you please update the proposal with details about the criteria to increment as well as reset these counters? Also about what happens to this state if the recommender restarts in the mean time (it is ok to say that such state is lost of the recommender restarts :-)) but being explicit helps.

Would it be possible to put the configuration knobs and switches (instead or in addition (overrides) to the global (then default) settings) into the VPA resources to drive the behaviour of the recommender more individually (e.g. as annotations if nothing else/for the time being and until maybe the autoscaler community shows interest in adopting this recommender)?

Configurability on the VPA resource level is very much desirable. For the time being, the only way we can do it is by using annotations which is ugly but workable. Another alternative is to continue to stick to global flags for now and use HVPA to control scale down update policy like we do for etcd right now.

amshuman-kr · 2021-01-28T08:20:21Z

If we double, won't we start seeing possibly a flapping behaviour?

It is really a trade-off between cost saving (by recommendation trying to closely hug the usage curve) vs reducing scaling based disturbance. If we keep the scale down window short (less than 1 or 2 hours) we save cost but risk multiple scalings during the day if there are multiple load spikes during the day. If we keep it longer we reduce disturbance but over-provision.

I think 1 or 2 hours is a good heuristic to start with. If there are multiple spikes during the day that last less than an hour, can we really afford to over-provision for the spikes even during long periods of low load.

This is just mental heuristics. We clearly have to learn about what time windows make sense, of course.

In general, I think there are two parts to the over all auto-scaling problem.

How good is the recommendation in terms of tracking usage and auto-healing.
How to avoid disturbance due to too frequent scaling for workloads that are sensitive to restarts.

I think 1 is better addressed in this proposal and 2 is better addressed in HVPA (in fact, already done so in v2). Mixing these two in one solution may not be a good thing.

If we agree on this, this is also worth putting explicitly in the proposal.

kallurbsk · 2021-01-28T08:52:28Z

Also adding to the 1st point that Amshu mentioned, while scale up is primarily based on current usage in the newer approach, scale down still has a parameter for time window called scale_down_monitor_time_window . This can be defaulted to 1 hour indicating the scale down kicks in only based on local maxima in the previous one hour. Also this is a configurable parameter which makes it set different scale down parameter for applications based on their resource consumption pattern.

@kallurbsk Can you please make this point (parallel deployment and assessment) explicit in the proposal?

Sure @amshuman-kr will cover that.

@kallurbsk Can you please update the proposal with details about the criteria to increment as well as reset these counters? Also about what happens to this state if the recommender restarts in the mean time (it is ok to say that such state is lost of the recommender restarts :-)) but being explicit helps.

Agreed. As of now the intention is to keep it incrementing till the VPA issues a "double increment on both CPU and memory resource". Also, the state of this is subject to restart of VPA recommender itself. if it does, then we need to reset this to default value. I will add this point in the document too.

vlerenc · 2021-01-28T15:41:16Z

Thanks @kallurbsk and @amshuman-kr . Yes it's a trade-off and I forgot about the recommender in read-only mode. When we pull out the data, we might have some inclination if the recommendations would have helped. Since they were not applied though, it will be difficult to assess whether they would have helped in the hours after, also (and what else would have happened, had we applied them...). Still, it doesn't have to be perfect/fully simulated, but having some data to challenge the assumptions and later implementation may help us, otherwise tuning may become difficult. Thanks!

amshuman-kr · 2021-01-29T13:11:15Z

Since they were not applied though, it will be difficult to assess whether they would have helped in the hours after, also (and what else would have happened, had we applied them...).

Yes. It may not be perfect data but I think we can still learn many things. For example,

If the new recommendations would have helped auto-heal from a crashloop.
If the new recommendations lead to more or less frequent scaling.
If the new recommendations track the usage curve more or less closely.

kallurbsk · 2021-02-17T06:58:57Z

vlerenc · 2022-03-14T10:36:28Z

/close as we have now another ticket (internally) that lists the issues in another format.
cc @voelzmo

vlerenc added area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related labels May 27, 2020

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Aug 1, 2020

vlerenc mentioned this issue Sep 24, 2020

Control Plane Stabilization & Auto-Scaling to 500 Nodes gardener/hvpa-controller#79

Open

8 tasks

amshuman-kr mentioned this issue Oct 5, 2020

etcd-druid Pod is not scaled up despite being OOMKilled gardener/gardener#2890

Closed

gardener-robot assigned kallurbsk Oct 6, 2020

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 6, 2020

kallurbsk mentioned this issue Jan 29, 2021

Vpa recommender #69

Closed

amshuman-kr added this to the 2021-Q1 milestone Jun 7, 2021

amshuman-kr added the roadmap/internal label Jun 7, 2021

amshuman-kr modified the milestones: 2021-Q1, 2021-Q3 Jun 9, 2021

kallurbsk mentioned this issue Aug 4, 2021

Addressing drawbacks of existing VPA recommender #93

Closed

10 tasks

shreyas-s-rao assigned shreyas-s-rao and unassigned kallurbsk Oct 4, 2021

shreyas-s-rao changed the title ~~Improve CPU Spike Handling with Spike-Aware VPA Recommender~~ Improve VPA Recommender Oct 4, 2021

shreyas-s-rao modified the milestones: 2021-Q3, 2022-Q1 Oct 6, 2021

shreyas-s-rao changed the title ~~Improve VPA Recommender~~ ☂️ Improve VPA Recommender Oct 6, 2021

shreyas-s-rao changed the title ~~☂️ Improve VPA Recommender~~ ☂️ Improve VPA Recommendations Oct 6, 2021

gardener-robot closed this as completed Mar 14, 2022

vlerenc removed this from the 2022-Q1 milestone Mar 14, 2022

gardener-robot added the kind/roadmap Roadmap BLI label Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

☂️ Improve VPA Recommendations #47

☂️ Improve VPA Recommendations #47

vlerenc commented May 27, 2020 •

edited by shreyas-s-rao

Loading

hardikdr commented Jun 1, 2020

amshuman-kr commented Jun 1, 2020

gardener-robot commented Jun 1, 2020

hardikdr commented Jun 1, 2020

vlerenc commented Jun 1, 2020 •

edited

Loading

amshuman-kr commented Sep 14, 2020

amshuman-kr commented Sep 21, 2020

amshuman-kr commented Sep 21, 2020

hardikdr commented Sep 21, 2020 •

edited

Loading

timebertt commented Oct 5, 2020

amshuman-kr commented Oct 5, 2020

amshuman-kr commented Oct 6, 2020

vlerenc commented Jan 27, 2021

vlerenc commented Jan 28, 2021 •

edited

Loading

kallurbsk commented Jan 28, 2021

amshuman-kr commented Jan 28, 2021 •

edited

Loading

amshuman-kr commented Jan 28, 2021

kallurbsk commented Jan 28, 2021 •

edited

Loading

vlerenc commented Jan 28, 2021

amshuman-kr commented Jan 29, 2021

kallurbsk commented Feb 17, 2021 •

edited

Loading

vlerenc commented Mar 14, 2022

☂️ Improve VPA Recommendations #47

☂️ Improve VPA Recommendations #47

Comments

vlerenc commented May 27, 2020 • edited by shreyas-s-rao Loading

hardikdr commented Jun 1, 2020

amshuman-kr commented Jun 1, 2020

gardener-robot commented Jun 1, 2020

hardikdr commented Jun 1, 2020

vlerenc commented Jun 1, 2020 • edited Loading

amshuman-kr commented Sep 14, 2020

amshuman-kr commented Sep 21, 2020

amshuman-kr commented Sep 21, 2020

Goals

hardikdr commented Sep 21, 2020 • edited Loading

timebertt commented Oct 5, 2020

amshuman-kr commented Oct 5, 2020

amshuman-kr commented Oct 6, 2020

vlerenc commented Jan 27, 2021

vlerenc commented Jan 28, 2021 • edited Loading

kallurbsk commented Jan 28, 2021

amshuman-kr commented Jan 28, 2021 • edited Loading

amshuman-kr commented Jan 28, 2021

kallurbsk commented Jan 28, 2021 • edited Loading

vlerenc commented Jan 28, 2021

amshuman-kr commented Jan 29, 2021

kallurbsk commented Feb 17, 2021 • edited Loading

vlerenc commented Mar 14, 2022

vlerenc commented May 27, 2020 •

edited by shreyas-s-rao

Loading

vlerenc commented Jun 1, 2020 •

edited

Loading

hardikdr commented Sep 21, 2020 •

edited

Loading

vlerenc commented Jan 28, 2021 •

edited

Loading

amshuman-kr commented Jan 28, 2021 •

edited

Loading

kallurbsk commented Jan 28, 2021 •

edited

Loading

kallurbsk commented Feb 17, 2021 •

edited

Loading