-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
☂️ Improve VPA Recommendations #47
Comments
Hi Vedran, is this issue really meant to be for the cluster-autoscaler? Or probably fits better at hvpa ? |
@hardikdr This is really for VPA. But there is no good place for filing issues for VPA. I think @vlerenc created it here because this is a fork of autoscaler and that technically includes VPA (though really we are forking only cluster-autoscaler). It is currently open if we go for a custom VPA recommender of our own or enhance the existing VPA recommender. |
It is meant for VPA as in https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler, which we don't have. ;-) I wrote in the Slack channel, that I don't know where to open this one. @amshuman-kr was part of this discussion in Slack. HVPA doesn't fit, I think. |
I see, thanks @amshuman-kr and @gardener-robot for the clarification .. :D |
Yeah, I am sometimes incognito. ;-) |
/assign @kallurbsk |
Somehow "/assign" didn't work and I do not have rights to assign directly either. @prashanth26 @hardikdr @AxiomSamarth could you please help? |
Adding a brief description of goals here because I don't have the rights to edit the issue description. Goals
|
@kallurbsk needs to be added in the gardener org first, before assigning the issues to him. |
Will this be contributed upstream in VPA or will we maintain another fork / gardener specific implementation? In a discussion w/ @rfranzke @vpnachev and some others, we were wondering if it would already help to make |
I am leaning towards our own implementation.
|
/assign @kallurbsk |
I think, it's not yet shared, the link to the document (WiP at the time of this writing): https://github.com/kallurbsk/autoscaler/blob/master/vertical-pod-autoscaler/docs/proposals/crash_loop_handling/README.md |
@kallurbsk @amshuman-kr Some thoughts after reading the proposal:
|
Thanks for sharing @vlerenc your comments and taking the pain to read through “not so clean” commit. |
It is really a trade-off between cost saving (by recommendation trying to closely hug the usage curve) vs reducing scaling based disturbance. If we keep the scale down window short (less than 1 or 2 hours) we save cost but risk multiple scalings during the day if there are multiple load spikes during the day. If we keep it longer we reduce disturbance but over-provision. I think 1 or 2 hours is a good heuristic to start with. If there are multiple spikes during the day that last less than an hour, can we really afford to over-provision for the spikes even during long periods of low load. This is just mental heuristics. We clearly have to learn about what time windows make sense, of course.
The first deliverable, in my mind, is that we should deploy the new custom recommender in a read only mode in parallel to the upstream VPA recommender. I.e. the custom recommender doesn't update the recommendation in the VPA status but in an annotation on the VPA resource which is then exported as a different metric by the vpa-exporter (also scraped and retained by prometheus). This way we can compare the recommendations and assess if anything needs to be improved. Once, we are confident with the recommendation, we deploy it in a more active way by updating the recommendation in the VPA status (and disabling the upstream VPA recommender). @kallurbsk Can you please make this point (parallel deployment and assessment) explicit in the proposal?
@kallurbsk Can you please update the proposal with details about the criteria to increment as well as reset these counters? Also about what happens to this state if the recommender restarts in the mean time (it is ok to say that such state is lost of the recommender restarts :-)) but being explicit helps.
Configurability on the VPA resource level is very much desirable. For the time being, the only way we can do it is by using annotations which is ugly but workable. Another alternative is to continue to stick to global flags for now and use HVPA to control scale down update policy like we do for etcd right now. |
In general, I think there are two parts to the over all auto-scaling problem.
I think 1 is better addressed in this proposal and 2 is better addressed in HVPA (in fact, already done so in v2). Mixing these two in one solution may not be a good thing. If we agree on this, this is also worth putting explicitly in the proposal. |
Also adding to the 1st point that Amshu mentioned, while scale up is primarily based on current usage in the newer approach, scale down still has a parameter for time window called
Sure @amshuman-kr will cover that.
Agreed. As of now the intention is to keep it incrementing till the VPA issues a "double increment on both CPU and memory resource". Also, the state of this is subject to restart of VPA recommender itself. if it does, then we need to reset this to default value. I will add this point in the document too. |
Thanks @kallurbsk and @amshuman-kr . Yes it's a trade-off and I forgot about the recommender in read-only mode. When we pull out the data, we might have some inclination if the recommendations would have helped. Since they were not applied though, it will be difficult to assess whether they would have helped in the hours after, also (and what else would have happened, had we applied them...). Still, it doesn't have to be perfect/fully simulated, but having some data to challenge the assumptions and later implementation may help us, otherwise tuning may become difficult. Thanks! |
Yes. It may not be perfect data but I think we can still learn many things. For example,
|
Tasks split for new VPA Recommender
|
/close as we have now another ticket (internally) that lists the issues in another format. |
What would you like to be added:
We would need better VPA recommendations for various cases of CPU and memory usage.
Why is this needed:
We see frequent outages of components such as the
gardenlet
when the load increases suddenly (spikes) and VPA only looks at the history instead of the actual usage.Discussion:
See: https://sap-cp.slack.com/archives/GBVUBHM5K/p1590564020106000?thread_ts=1590540922.098400&cid=GBVUBHM5K
Roadmap
OOMBumpUpRatio
andOOMMinBumpUp
values, which are currently hardcodedoom_observer
package. Similar package likecputhrottling_obsberver
needs to be built to handle CPU spike and throttling casesThe text was updated successfully, but these errors were encountered: