Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VPA - Pods do not get updated requests/limits #2633

Closed
wyb1 opened this issue Dec 10, 2019 · 8 comments
Closed

VPA - Pods do not get updated requests/limits #2633

wyb1 opened this issue Dec 10, 2019 · 8 comments

Comments

@wyb1
Copy link
Contributor

wyb1 commented Dec 10, 2019

I've noticed that sometimes a cluster gets into a state where the vpa-updater is constantly killing pods because the requests/limits are not being updated. According to the logs of the vpa-admission-controller the pods should be updated:

recommendation_provider.go:113] updating requirements for pod pod-xyz.

However the pod is not updated with the new requests. Deleting the vpa-admission-controller solves the problem presumably because the vpa-webhook endpoint is reconciled.

Is this behavior known and can it be prevented? I'm running version 0.6.3

@bskiba
Copy link
Member

bskiba commented Dec 27, 2019

I have not seen this behaviour reported yet, though there is currently no mechanism to guard against repetitive evictions by updater if the admission controller is not applying recommendations.

If the admission controller updates recommendations of pods without reporting errors, but then the created pods have different ones, I would expect that there is another webhook that overrides changes done by VPA. Is this a possible explanation in your setup? The order of Mutating Webhooks installation is unfortunately very important.

@wyb1
Copy link
Contributor Author

wyb1 commented Jan 7, 2020

I don't believe another webhook interfered because we have many clusters with the same configuration and has only occurred a few times on different clusters. Unfortunately, since this is a rare occurrence I have not been able to reproduce it.

Are there any plans on implementing a safety mechanism to prevent repetitive evictions if the recommendations are not applied?

@bskiba
Copy link
Member

bskiba commented Jan 7, 2020

Hm, tricky. When this happens is it a permanent state until you redeploy the admission controller? If you run into this again, it would be really helpful to try and root cause it.

Yeah, implementing the safety mechanism is definitely on the list, though we are not working on it right now. Let me file an issue.

@wyb1
Copy link
Contributor Author

wyb1 commented Jan 13, 2020

Thanks for opening another issue for the safety mechanism.

When this happens is it a permanent state until you redeploy the admission controller?

Yes, the resources of the pod spec cannot be changed via webhook until the vpa-admission-controller is restarted. If this happens again I will try to find the root cause.

@wyb1
Copy link
Contributor Author

wyb1 commented Jan 14, 2020

@bskiba the issue came up again on a different cluster and I was able to find the root cause. Somehow the mutating webhook got deleted (presumably by a person). Maybe it would make sense for the admission controller to check if the webhook exists and if it doesn't then it will recreate the webhook.

@bskiba
Copy link
Member

bskiba commented Jan 14, 2020

Thanks for root-causing this. My gut feeling is that maybe reinstalling webhook every time it is deleted might not be desirable, but I would like to include this case in the safety mechanism. So if this happens the infinite update loop does not happen (i.e. updater stops evicting pods). WDYT?
/cc @krzysied FYI

@bskiba
Copy link
Member

bskiba commented Jan 17, 2020

Since the root-cause has been handled, I'd like to close this in favour of #2738
/close

@k8s-ci-robot
Copy link
Contributor

@bskiba: Closing this issue.

In response to this:

Since the root-cause has been handled, I'd like to close this in favour of #2738
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants