Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

☂️ Improve VPA Recommendations #47

Closed
4 of 18 tasks
vlerenc opened this issue May 27, 2020 · 22 comments
Closed
4 of 18 tasks

☂️ Improve VPA Recommendations #47

vlerenc opened this issue May 27, 2020 · 22 comments
Assignees
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension kind/roadmap Roadmap BLI lifecycle/rotten Nobody worked on this for 12 months (final aging stage)

Comments

@vlerenc
Copy link
Member

vlerenc commented May 27, 2020

What would you like to be added:
We would need better VPA recommendations for various cases of CPU and memory usage.

Why is this needed:
We see frequent outages of components such as the gardenlet when the load increases suddenly (spikes) and VPA only looks at the history instead of the actual usage.

Discussion:
See: https://sap-cp.slack.com/archives/GBVUBHM5K/p1590564020106000?thread_ts=1590540922.098400&cid=GBVUBHM5K

Roadmap

  • Include memory samples in TotalSamplesCount to react better to OOMKills #101
    • Bug: there is a TotalSamplesCount variable which is calculated every time a usage sample is added to VPA recommender. The VPA recommender however updates it only for addition of CPU samples, but memory samples are not considered. During an OOM scenario, the cpu usage sample remains zero. This leads to TotalSamplesCount being 0. Hence even though VPA recommender records memory samples, it does not give any recommendation as the code flow checks for TotalSamplesCount > 0 to give a recommendation.
    • This is a prerequisite for some of the other improvements mentioned below
    • Unit tests
  • Add grafana dashboard to monitor OOMKills and CPU throttling behavior and recommendations
  • Configurable OOMBumpUpRatio and OOMMinBumpUp values, which are currently hardcoded
    • While generally the VPA is able to help recover a pod from an OOMKill, the time it takes or number of restarts is sometimes unacceptable. We can reduce the number of restarts and subsequently the time taken to recover, by making these values customizable, and maybe even dynamically calculated
    • The new ratio can either be defined as a static configuration in the VPA resource, or can be computed dynamically based on previous samples from last N minutes/samples before OOMKill (very high weightage to recent-most samples) to especially handle spikes. The dynamic method ensures that the VPA CRD remains untouched, but will require some simulations and tests to come up with a good algorithm for this
    • Unit tests
  • In cases of container errors/restarts, do not recommend values lower than last utilization
    • To avoid resource starvation for the container once it recovers from error
    • For instance, recommendation = max(recommendation, last utilization or max(last N utilization snapshots))
    • Or maybe also wait for N minutes/hours before deciding to recommend lower values
    • This behaviour can become better with predictive scaling, as discussed later
  • Incorporate CPU throttling information, which is currently ignored
    • VPA Recommender already has special behaviour to observe and record OOMKills with RecordOOM, via the oom_observer package. Similar package like cputhrottling_obsberver needs to be built to handle CPU spike and throttling cases
    • The source for this metric must be configurable and optional, to retain backward compatibility
    • Unit tests
  • Make all/most hardcoded values configurable (with via VPA resource or atleast as CLI flags)
  • Scale requests and limits separately
    • Proportional scaling works for most but not all cases
    • Find out in which cases it doesn’t work well (pod initializations, burst usage, etc)
    • We don’t want ridiculously high limits like 80GB of memory, but we also don’t want too small values where the pod initialization itself fails
    • For instance, limits recommendation = max or P99 of limits observations in last N samples or given time window
    • Unit tests
  • Explore horizontal scaling using request rates
    • We can then use VPA and HPA separately, without the need for HVPA
    • This can be done only if VPA by itself is capable of handling all the special cases that HVPA can handle, and if it is as configurable as HVPA
    • Unit tests
  • Individual pod scaling instead of entire controller resources (deployment/statefulset/replicaset)
    • Only scale the pods that are being stressed
    • With in-place resource scaling, can also be applied to container-level scaling
    • Unit tests
  • Predictive scaling of resources using machine learning or deep learning
    • Allows VPA Recommender to be proactive instead of reactive
    • Resource-specific models would understand patterns better and provide better scaling recommendations
    • For instance, daily hibernation schedule of Gardener shoots puts high load on the cluster's control plane at a fixed time everyday. With predictive scaling, VPA will be able to scale up the resources for these components few minutes before it gets stressed, so that potential downtime can be avoided
    • Tests
@vlerenc vlerenc added area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related labels May 27, 2020
@hardikdr
Copy link
Member

hardikdr commented Jun 1, 2020

Hi Vedran, is this issue really meant to be for the cluster-autoscaler? Or probably fits better at hvpa ?

@amshuman-kr
Copy link

@hardikdr This is really for VPA. But there is no good place for filing issues for VPA. I think @vlerenc created it here because this is a fork of autoscaler and that technically includes VPA (though really we are forking only cluster-autoscaler).

It is currently open if we go for a custom VPA recommender of our own or enhance the existing VPA recommender.

@gardener-robot
Copy link

It is meant for VPA as in https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler, which we don't have. ;-) I wrote in the Slack channel, that I don't know where to open this one. @amshuman-kr was part of this discussion in Slack. HVPA doesn't fit, I think.

@hardikdr
Copy link
Member

hardikdr commented Jun 1, 2020

I see, thanks @amshuman-kr and @gardener-robot for the clarification .. :D

@vlerenc
Copy link
Member Author

vlerenc commented Jun 1, 2020

Yeah, I am sometimes incognito. ;-)

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Aug 1, 2020
@amshuman-kr
Copy link

/assign @kallurbsk

@amshuman-kr
Copy link

Somehow "/assign" didn't work and I do not have rights to assign directly either. @prashanth26 @hardikdr @AxiomSamarth could you please help?

@amshuman-kr
Copy link

Adding a brief description of goals here because I don't have the rights to edit the issue description.

Goals

  • Pluggable in place of VPA recommender.
  • React better to CPU and memory usage spikes (especially, avoid recommending lower than resource usage)
  • React better to crashloops and possibly achieve automatic recovery if the crashes are triggered by lack of resources
  • Scale down faster if usage is consistently lower for some period of time (for the time-being configurable globally via CLI flags because we are re-using the VPA CRD)
  • Optionally support recommendations in annotations so that this can be used in parallel with VPA recommender and compared by enhancing vpa-exporter to additionally export recommendations from the annotations.

@hardikdr
Copy link
Member

hardikdr commented Sep 21, 2020

@kallurbsk needs to be added in the gardener org first, before assigning the issues to him.

@timebertt
Copy link
Member

Will this be contributed upstream in VPA or will we maintain another fork / gardener specific implementation?

In a discussion w/ @rfranzke @vpnachev and some others, we were wondering if it would already help to make OOMBumpUpRatio and OOMMinBumpUp configurable.
Are you planning on doing so? Or is this not reasonable?

@amshuman-kr
Copy link

Will this be contributed upstream in VPA or will we maintain another fork / gardener specific implementation?

I am leaning towards our own implementation.

  • Considering that CLI flags for VPA recommender aggregations config. kubernetes/autoscaler#2905 (needed to improve costs in out seeds) is yet to be released since May 2020, I think we have no option but to go for a fork at least.
  • But in fact, I have a much simpler from-scratch implementation in mind (instead of a a fork) with a bias towards the present than the history and using a stabilization period for scale down (configurable via CLI flag, for the time being), which can make it work for both stable load as well as spikes.

@amshuman-kr
Copy link

/assign @kallurbsk

@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 6, 2020
@vlerenc
Copy link
Member Author

vlerenc commented Jan 27, 2021

I think, it's not yet shared, the link to the document (WiP at the time of this writing): https://github.com/kallurbsk/autoscaler/blob/master/vertical-pod-autoscaler/docs/proposals/crash_loop_handling/README.md

@vlerenc
Copy link
Member Author

vlerenc commented Jan 28, 2021

@kallurbsk @amshuman-kr Some thoughts after reading the proposal:

  • If we double, won't we start seeing possibly a flapping behaviour?
  • It would be good to run a test/simulation of how the new recommender logic would fare in comparison to the current one on some real recorded issue(s) (is it giving more resources/the necessary resources when needed?; is it using more or potentially even less resources, because of the more aggressive scale-up resp. faster scale-down?)
  • The idea with no_of_crashes to determine when to double CPU and memory is an interesting idea/pragmatic one. Nice. How is the counter reset?
  • Would it be possible to put the configuration knobs and switches (instead or in addition (overrides) to the global (then default) settings) into the VPA resources to drive the behaviour of the recommender more individually (e.g. as annotations if nothing else/for the time being and until maybe the autoscaler community shows interest in adopting this recommender)?

@kallurbsk
Copy link

Thanks for sharing @vlerenc your comments and taking the pain to read through “not so clean” commit.
To quickly address the 2nd point, I will be submitting a simulation that we are doing manually of how the new recommender would be able to handle a CrashLoopBackOff caused due to resource contention peaking. I will also address other comments once I upload the last version from myside within a day.

@amshuman-kr
Copy link

amshuman-kr commented Jan 28, 2021

If we double, won't we start seeing possibly a flapping behaviour?

It is really a trade-off between cost saving (by recommendation trying to closely hug the usage curve) vs reducing scaling based disturbance. If we keep the scale down window short (less than 1 or 2 hours) we save cost but risk multiple scalings during the day if there are multiple load spikes during the day. If we keep it longer we reduce disturbance but over-provision.

I think 1 or 2 hours is a good heuristic to start with. If there are multiple spikes during the day that last less than an hour, can we really afford to over-provision for the spikes even during long periods of low load.

This is just mental heuristics. We clearly have to learn about what time windows make sense, of course.

  • It would be good to run a test/simulation of how the new recommender logic would fare in comparison to the current one on some real recorded issue(s) (is it giving more resources/the necessary resources when needed?; is it using more or potentially even less resources, because of the more aggressive scale-up resp. faster scale-down?)

The first deliverable, in my mind, is that we should deploy the new custom recommender in a read only mode in parallel to the upstream VPA recommender. I.e. the custom recommender doesn't update the recommendation in the VPA status but in an annotation on the VPA resource which is then exported as a different metric by the vpa-exporter (also scraped and retained by prometheus). This way we can compare the recommendations and assess if anything needs to be improved. Once, we are confident with the recommendation, we deploy it in a more active way by updating the recommendation in the VPA status (and disabling the upstream VPA recommender).

@kallurbsk Can you please make this point (parallel deployment and assessment) explicit in the proposal?

The idea with no_of_crashes to determine when to double CPU and memory is an interesting idea/pragmatic one. Nice. How is the counter reset?

@kallurbsk Can you please update the proposal with details about the criteria to increment as well as reset these counters? Also about what happens to this state if the recommender restarts in the mean time (it is ok to say that such state is lost of the recommender restarts :-)) but being explicit helps.

Would it be possible to put the configuration knobs and switches (instead or in addition (overrides) to the global (then default) settings) into the VPA resources to drive the behaviour of the recommender more individually (e.g. as annotations if nothing else/for the time being and until maybe the autoscaler community shows interest in adopting this recommender)?

Configurability on the VPA resource level is very much desirable. For the time being, the only way we can do it is by using annotations which is ugly but workable. Another alternative is to continue to stick to global flags for now and use HVPA to control scale down update policy like we do for etcd right now.

@amshuman-kr
Copy link

If we double, won't we start seeing possibly a flapping behaviour?

It is really a trade-off between cost saving (by recommendation trying to closely hug the usage curve) vs reducing scaling based disturbance. If we keep the scale down window short (less than 1 or 2 hours) we save cost but risk multiple scalings during the day if there are multiple load spikes during the day. If we keep it longer we reduce disturbance but over-provision.

I think 1 or 2 hours is a good heuristic to start with. If there are multiple spikes during the day that last less than an hour, can we really afford to over-provision for the spikes even during long periods of low load.

This is just mental heuristics. We clearly have to learn about what time windows make sense, of course.

In general, I think there are two parts to the over all auto-scaling problem.

  1. How good is the recommendation in terms of tracking usage and auto-healing.
  2. How to avoid disturbance due to too frequent scaling for workloads that are sensitive to restarts.

I think 1 is better addressed in this proposal and 2 is better addressed in HVPA (in fact, already done so in v2). Mixing these two in one solution may not be a good thing.

If we agree on this, this is also worth putting explicitly in the proposal.

@kallurbsk
Copy link

kallurbsk commented Jan 28, 2021

Also adding to the 1st point that Amshu mentioned, while scale up is primarily based on current usage in the newer approach, scale down still has a parameter for time window called scale_down_monitor_time_window . This can be defaulted to 1 hour indicating the scale down kicks in only based on local maxima in the previous one hour. Also this is a configurable parameter which makes it set different scale down parameter for applications based on their resource consumption pattern.

@kallurbsk Can you please make this point (parallel deployment and assessment) explicit in the proposal?

Sure @amshuman-kr will cover that.

@kallurbsk Can you please update the proposal with details about the criteria to increment as well as reset these counters? Also about what happens to this state if the recommender restarts in the mean time (it is ok to say that such state is lost of the recommender restarts :-)) but being explicit helps.

Agreed. As of now the intention is to keep it incrementing till the VPA issues a "double increment on both CPU and memory resource". Also, the state of this is subject to restart of VPA recommender itself. if it does, then we need to reset this to default value. I will add this point in the document too.

@vlerenc
Copy link
Member Author

vlerenc commented Jan 28, 2021

Thanks @kallurbsk and @amshuman-kr . Yes it's a trade-off and I forgot about the recommender in read-only mode. When we pull out the data, we might have some inclination if the recommendations would have helped. Since they were not applied though, it will be difficult to assess whether they would have helped in the hours after, also (and what else would have happened, had we applied them...). Still, it doesn't have to be perfect/fully simulated, but having some data to challenge the assumptions and later implementation may help us, otherwise tuning may become difficult. Thanks!

@amshuman-kr
Copy link

Since they were not applied though, it will be difficult to assess whether they would have helped in the hours after, also (and what else would have happened, had we applied them...).

Yes. It may not be perfect data but I think we can still learn many things. For example,

  1. If the new recommendations would have helped auto-heal from a crashloop.
  2. If the new recommendations lead to more or less frequent scaling.
  3. If the new recommendations track the usage curve more or less closely.

@kallurbsk
Copy link

kallurbsk commented Feb 17, 2021

Tasks split for new VPA Recommender

  • Define proposal document for new VPA recommender - Proposal

  • First level code skeleton to design new VPA recommender - Code

  • v0 version complete of the new VPA recommender code. - Code

  • Unit testing and code coverage to cover all new methods written as a part of VPA recommender
    Code
    UT

  • Code refactoring based on the review comments (This will be an ongoing item but I will be mentioning this item here so that it can be tracked)

  • Including component level threshold parameters within VPA CRD annotations in the code.

  • Test v0 version of new VPA recommender in developer landscape shoot cluster with progrium/stress
    Manual Test Results and Analysis

  • Test v0 version of new VPA recommender along with existing VPA recommender code in suggest only mode in developer landscape. - issue2

  • Integration test suite for automation of manual functional tests completed in the above point. - issue3

  • Use new VPA recommender in write mode and observe changes in the behavior from older VPA recommender in developer landscape - issue4

  • Use new VPA recommender in read only mode along with existing VPA recommender in live landscape - issue5

  • Use new VPA recommnder in write mode in live landscape

@amshuman-kr amshuman-kr added this to the 2021-Q1 milestone Jun 7, 2021
@amshuman-kr amshuman-kr modified the milestones: 2021-Q1, 2021-Q3 Jun 9, 2021
@shreyas-s-rao shreyas-s-rao changed the title Improve CPU Spike Handling with Spike-Aware VPA Recommender Improve VPA Recommender Oct 4, 2021
@shreyas-s-rao shreyas-s-rao modified the milestones: 2021-Q3, 2022-Q1 Oct 6, 2021
@shreyas-s-rao shreyas-s-rao changed the title Improve VPA Recommender ☂️ Improve VPA Recommender Oct 6, 2021
@shreyas-s-rao shreyas-s-rao changed the title ☂️ Improve VPA Recommender ☂️ Improve VPA Recommendations Oct 6, 2021
@vlerenc
Copy link
Member Author

vlerenc commented Mar 14, 2022

/close as we have now another ticket (internally) that lists the issues in another format.
cc @voelzmo

@vlerenc vlerenc removed this from the 2022-Q1 milestone Mar 14, 2022
@gardener-robot gardener-robot added the kind/roadmap Roadmap BLI label Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension kind/roadmap Roadmap BLI lifecycle/rotten Nobody worked on this for 12 months (final aging stage)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants