Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressing drawbacks of existing VPA recommender #93

Closed
4 of 10 tasks
kallurbsk opened this issue Aug 4, 2021 · 3 comments
Closed
4 of 10 tasks

Addressing drawbacks of existing VPA recommender #93

kallurbsk opened this issue Aug 4, 2021 · 3 comments
Labels
kind/enhancement Enhancement, improvement, extension kind/epic Large multi-story topic priority/2 Priority (lower number equals higher priority)
Milestone

Comments

@kallurbsk
Copy link

kallurbsk commented Aug 4, 2021

How to categorize this issue?
/kind epic
/priority 2

What would you like to be added:

  • Address CPU samples only count logic in the existing VPA recommender to count both CPU and memory samples.
  • Fix/Address the behaviour of using a hardcoded value of OOMBumpUpRatio and OOMMinBumpUp.
  • Incorporate CPU throttling information to the VPA recommender
  • Incorporate variables to handle configuration of parameters used in VPA recommendation on per component basis.

Why is this needed:

  • There is a TotalSamplesCount variable which is calculated every time a usage sample is added to VPA recommender. The VPA recommender however updates it only for addition of CPU samples. Memory samples are not considered. During an OOM scenario, the cpu usage sample remains zero. This leads to TotalSamplesCount being 0. Hence the VPA recommender even while records memory samples, does not give any recommendation as the code flow checks for TotalSamplesCount > 0 to give recommendation.
  • OOMBumpUpRatio and OOMMinBumpUp are defined as hardcoded constants today. While most of the times the VPA is able to help recover a pod from an OOMKill, the time it takes or number of restarts is sometimes a bit significant. If we make the 2 parameters user defined and customizable, the number of restarts or time it takes to recover can be reduced.
  • CPU throttling information similar to RecordOOM is missing in the VPA recommender today. This does not address the problem of recovery of a component using VPA if the CPU suddenly surges. A similar to RecordOOM should be added for CPU. This will ensure the CPU is surge is also addressed as a sample while counting the list of samples for a given resource, before determining the recommendation value.

Decisions made:

  • Raise issue/enhancement requests in both gardener/autoscaler and there by kubernetes/autoscaler repositories.
  • Write set of Unit Testcases proving the hypothesis mentioned in the sections above.
  • Address the changes incrementally from the set of points mentioned in sections above.
  • Create a separate branch in gardener/autoscaler which will mirror the kubernetes/autoscaler:master. As gardener/autoscaler has customization w.r.t MCM in cluster-autoscaler, it is better to keep them both separate.
  • All changes to the kubernetes/autoscaler upstream can be proposed by the developer.
  • Create 2 parallel tracks of operation
    • Generating builds regularly for the enhancements made in VPA recommender within gardener project.
    • Proposing the upstream changes for every change going in gardener/autoscaler w.r.t VPA. (This is a long poll activity)

Future work
The following parameters are either hardcoded or globally set across a VPA recommender instance. If we can make changes to set them on a per component level basis (ex: kube-apiserver, etcd, node-exporter etc,) it will result in giving custom recommendations on component levels. This can help address the changes specifically w.r.t a component and tune the parameters accordingly for better recommendation.
* OOMBumpUpRatio, OOMMinBumpUp, TargetPercentile, memoryAggregationInterval, memoryAggregationIntervalCount, memoryHistogramDecayHalfLife, cpuHistogramDecayHalfLife
* UT for adding the customization of components (reduced priority for now).

Tasks:

  • Creating a custom VPA changes specific branch on gardener/autoscaler to mirror latest kubernetes/autoscaler:master
    • Make sure all latest changes in kubernetes/autoscaler:master is pulled into this branch. This branch will become the baseline for all checkins going into kubernetes/autoscaler:master eventually.
  • Fixing TotalSamplesCount
    • UT for issue with addressing TotalSamplesCount
    • Merge the fix to gardener/autoscaler and generate a custom image
    • Merge the fix to kubernetes/autoscaler upstream master
  • Addressing hardcoded OOMBumpUpRatio and OOMMinBumpUp
    • UT for issue with addressing hardcoded OOMBumpUpRatio and OOMMinBumpUp
    • Merge the fix to gardener/autoscaler and generate a custom image
    • Merge the fix to kubernetes/autoscaler upstream master
  • Adding CPU Throttling information into VPA code
    • UT for addressing CPU throttling information similar to RecordOOM
    • Merge the fix to gardener/autoscaler and generate a custom image
    • Merge the fix to kubernetes/autoscaler upstream master

References:

  1. ☂️ Improve VPA Recommendations #47
  2. [VPA] issues with OOM events  kubernetes/autoscaler#4152
  3. VPA : memory recommedation is lower than the usage kubernetes/autoscaler#3961
  4. Added changes to support alternative recommender kubernetes/autoscaler#4131
  5. VPA: Accuracy of the estimation of the target value kubernetes/autoscaler#2389
  6. Added changes to support alternative recommender kubernetes/autoscaler#4131
  7. https://kubernetes.slack.com/archives/C09R1LV8S/p1627275915339100
@kallurbsk kallurbsk added the kind/enhancement Enhancement, improvement, extension label Aug 4, 2021
@gardener-robot gardener-robot added kind/epic Large multi-story topic priority/2 Priority (lower number equals higher priority) labels Aug 4, 2021
@ashwani2k
Copy link

credits to @voelzmo for your Epic gardener/gardener#4110 which gave a good insight on how to structure this.

@kallurbsk
Copy link
Author

@voelzmo : Thanks for having this format. Could we please make this like a standard format for all set of epics and tasks if not already followed?

kallurbsk added a commit to kallurbsk/autoscaler that referenced this issue Aug 18, 2021
Tests:
TestTotalSamplesCountMemorySamplesOnly
TestRecordOOMSampleAndCountTotalSamples
TestRecommendationForZeroTotalSamplesCount
TestAddMemorySurgeWithOOMBumpUpRatio
TestAddMemorySurgeWithOOMMinBump
TestRecordCPUThrottlingInformation

The above test cases currently fail indicating the problem
with to be fixed as mentioned in gardener#93. They validate the
change required in the behavior for VPA recommendation.
@ashwani2k ashwani2k modified the milestones: 2021-Q4, 2022-Q1 Sep 7, 2021
@shreyas-s-rao
Copy link

Closing this issue in favour of #47 which will now act as the epic for all VPA-related improvements/fixes going forward. Thanks @kallurbsk for penning down the list of tasks and adding unit tests as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension kind/epic Large multi-story topic priority/2 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

4 participants