Addressing drawbacks of existing VPA recommender #93

kallurbsk · 2021-08-04T08:44:48Z

How to categorize this issue?
/kind epic
/priority 2

What would you like to be added:

Address CPU samples only count logic in the existing VPA recommender to count both CPU and memory samples.
Fix/Address the behaviour of using a hardcoded value of OOMBumpUpRatio and OOMMinBumpUp.
Incorporate CPU throttling information to the VPA recommender
Incorporate variables to handle configuration of parameters used in VPA recommendation on per component basis.

Why is this needed:

There is a TotalSamplesCount variable which is calculated every time a usage sample is added to VPA recommender. The VPA recommender however updates it only for addition of CPU samples. Memory samples are not considered. During an OOM scenario, the cpu usage sample remains zero. This leads to TotalSamplesCount being 0. Hence the VPA recommender even while records memory samples, does not give any recommendation as the code flow checks for TotalSamplesCount > 0 to give recommendation.
OOMBumpUpRatio and OOMMinBumpUp are defined as hardcoded constants today. While most of the times the VPA is able to help recover a pod from an OOMKill, the time it takes or number of restarts is sometimes a bit significant. If we make the 2 parameters user defined and customizable, the number of restarts or time it takes to recover can be reduced.
CPU throttling information similar to RecordOOM is missing in the VPA recommender today. This does not address the problem of recovery of a component using VPA if the CPU suddenly surges. A similar to RecordOOM should be added for CPU. This will ensure the CPU is surge is also addressed as a sample while counting the list of samples for a given resource, before determining the recommendation value.

Decisions made:

Raise issue/enhancement requests in both gardener/autoscaler and there by kubernetes/autoscaler repositories.
Write set of Unit Testcases proving the hypothesis mentioned in the sections above.
Address the changes incrementally from the set of points mentioned in sections above.
Create a separate branch in gardener/autoscaler which will mirror the kubernetes/autoscaler:master. As gardener/autoscaler has customization w.r.t MCM in cluster-autoscaler, it is better to keep them both separate.
All changes to the kubernetes/autoscaler upstream can be proposed by the developer.
Create 2 parallel tracks of operation
- Generating builds regularly for the enhancements made in VPA recommender within gardener project.
- Proposing the upstream changes for every change going in gardener/autoscaler w.r.t VPA. (This is a long poll activity)

Future work
The following parameters are either hardcoded or globally set across a VPA recommender instance. If we can make changes to set them on a per component level basis (ex: kube-apiserver, etcd, node-exporter etc,) it will result in giving custom recommendations on component levels. This can help address the changes specifically w.r.t a component and tune the parameters accordingly for better recommendation.
* OOMBumpUpRatio, OOMMinBumpUp, TargetPercentile, memoryAggregationInterval, memoryAggregationIntervalCount, memoryHistogramDecayHalfLife, cpuHistogramDecayHalfLife
* UT for adding the customization of components (reduced priority for now).

Tasks:

Creating a custom VPA changes specific branch on gardener/autoscaler to mirror latest kubernetes/autoscaler:master
- Make sure all latest changes in kubernetes/autoscaler:master is pulled into this branch. This branch will become the baseline for all checkins going into kubernetes/autoscaler:master eventually.
Fixing TotalSamplesCount
- UT for issue with addressing TotalSamplesCount
- Merge the fix to gardener/autoscaler and generate a custom image
- Merge the fix to kubernetes/autoscaler upstream master
Addressing hardcoded OOMBumpUpRatio and OOMMinBumpUp
- UT for issue with addressing hardcoded OOMBumpUpRatio and OOMMinBumpUp
- Merge the fix to gardener/autoscaler and generate a custom image
- Merge the fix to kubernetes/autoscaler upstream master
Adding CPU Throttling information into VPA code
- UT for addressing CPU throttling information similar to RecordOOM
- Merge the fix to gardener/autoscaler and generate a custom image
- Merge the fix to kubernetes/autoscaler upstream master

References:

The text was updated successfully, but these errors were encountered:

ashwani2k · 2021-08-04T08:57:29Z

credits to @voelzmo for your Epic gardener/gardener#4110 which gave a good insight on how to structure this.

kallurbsk · 2021-08-04T08:58:56Z

@voelzmo : Thanks for having this format. Could we please make this like a standard format for all set of epics and tasks if not already followed?

Tests: TestTotalSamplesCountMemorySamplesOnly TestRecordOOMSampleAndCountTotalSamples TestRecommendationForZeroTotalSamplesCount TestAddMemorySurgeWithOOMBumpUpRatio TestAddMemorySurgeWithOOMMinBump TestRecordCPUThrottlingInformation The above test cases currently fail indicating the problem with to be fixed as mentioned in gardener#93. They validate the change required in the behavior for VPA recommendation.

shreyas-s-rao · 2021-10-04T14:57:12Z

Closing this issue in favour of #47 which will now act as the epic for all VPA-related improvements/fixes going forward. Thanks @kallurbsk for penning down the list of tasks and adding unit tests as well.

kallurbsk added the kind/enhancement Enhancement, improvement, extension label Aug 4, 2021

gardener-robot added kind/epic Large multi-story topic priority/2 Priority (lower number equals higher priority) labels Aug 4, 2021

kallurbsk mentioned this issue Aug 9, 2021

Pulling latest changes in upstream master to gardener/autoscaler:vpa_enhanced #94

Closed

kallurbsk mentioned this issue Aug 18, 2021

Adding VPA modifications tests for the issue #93 in autoscaler #96

Closed

ashwani2k modified the milestones: 2021-Q4, 2022-Q1 Sep 7, 2021

shreyas-s-rao closed this as completed Oct 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Addressing drawbacks of existing VPA recommender #93

Addressing drawbacks of existing VPA recommender #93

kallurbsk commented Aug 4, 2021 •

edited

Loading

ashwani2k commented Aug 4, 2021

kallurbsk commented Aug 4, 2021

shreyas-s-rao commented Oct 4, 2021

Addressing drawbacks of existing VPA recommender #93

Addressing drawbacks of existing VPA recommender #93

Comments

kallurbsk commented Aug 4, 2021 • edited Loading

ashwani2k commented Aug 4, 2021

kallurbsk commented Aug 4, 2021

shreyas-s-rao commented Oct 4, 2021

kallurbsk commented Aug 4, 2021 •

edited

Loading