VPA: Protect against runaway pods #2359

johanneswuerbach · 2019-09-18T19:37:45Z

After #2049 VPA is adjusting the specified limit proportionally to the requested resources.

This seems problematic when recommended resources become fairly low. E.g. in a statefulset we ran with limit: 1000m and requested: 500m and once recommended target got lowered to 25m (due to low usage) with a limit of 50m. While this works normally, any spike in consumption now means that the pod is immediately throttled, the health check fails and the pod is recreated, but never seems to cause an increase of the recommended value.

I would see three ways to prevent this:

set a min target for each VPA resource, which seems a bit like anti-pattern as the magic of VPA is that we have to worry less about resources.
remove the limits & requests from the statefulset, which works, but also means a runaway pod could affect the entire node as CPU is no longer throttled.
modify VPA to set the MaxAllowed as pod limits if no other limit is specified, which would allow sudden resource changes in certain boundaries

Overall I like 3) the most, would you accept a PR implementing this change? Is there anything I'm missing?

The text was updated successfully, but these errors were encountered:

johanneswuerbach · 2019-09-18T19:38:38Z

/assign bskiba

bskiba · 2019-09-20T12:26:45Z

3 seems a bit problematic, as this means that effectively VPA would start changing pod's qos class which we are explicitly trying to avoid.

My recommendation is a variation of 2 - remove the resource limit for CPU on the pod.
In general we don't recommend setting limits on CPU (except in very very specific cases maybe), since with CPU, a pod will run away only in case there are free CPU resources on the node. If there are other pods requesting cpu on the same node, the pod will be throttled to what it actually requested, so it will never cannibalize the node resources.
With limit gone, the pod will be throttled only if a pod is very tightly packed, but if it has any free space to burst over the request, VPA should notice and bump the resource request up accordingly.

We do recommend limits on memory but with memory you will get Out of Memory exceptions in situations you described, which will cause VPA to bump the memory recommendation up.

phsiao · 2019-09-20T14:41:21Z

I have been trying to understand the implications of letting VPA managing limit, and I have came to similar thought --- since CPU is "elastic" and cgroups can provide an effective isolation, removing CPU limit so it can't be changed in a surprised way seems to make sense; on the other hand, memory usage run-away can cause the entire system to OOM and trigger expensive reaping process and cause other issues, so limits on memory must remain at least to reduce the risk.

However, I also want to suggest that managing limit should be made optional if it is not already so. I think the main use case of vpa is to make requested resource closer to actual resource so scheduling and load balancing can be efficient. Limit does not play a role here, afaik, and can have complexity that the main use case users do not want to carry.

johanneswuerbach · 2019-09-20T14:52:10Z

Instead of 3) I would also be happy if the proportional scaling of requested & limits can be disabled if the limit != requested.

I see this definitely useful when both are equal to remain inside the same QoS class, but what is the benefit of changing the limit proportional in other cases? I found that fairly surprising.

fejta-bot · 2019-12-19T15:23:47Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

bskiba · 2020-01-17T09:54:42Z

Closing this in favour of #2387 that seems to be asking for same thing (configure if limits should be modified)

bskiba · 2020-01-17T09:54:45Z

/close

k8s-ci-robot · 2020-01-17T09:54:47Z

@bskiba: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot assigned bskiba Sep 18, 2019

johanneswuerbach mentioned this issue Sep 19, 2019

VPA: Limit containers to maxAllowed if no limit is present #2362

Closed

johanneswuerbach mentioned this issue Sep 26, 2019

VPA: configurable resource #2387

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2019

k8s-ci-robot closed this as completed Jan 17, 2020

johanneswuerbach mentioned this issue Apr 6, 2020

VPA: Configurable container limit scaling #3028

Merged

raywainman mentioned this issue Sep 17, 2024

VPA Not honoring maxAllowed Memory Limit #6996

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VPA: Protect against runaway pods #2359

VPA: Protect against runaway pods #2359

johanneswuerbach commented Sep 18, 2019

johanneswuerbach commented Sep 18, 2019

bskiba commented Sep 20, 2019 •

edited

Loading

phsiao commented Sep 20, 2019

johanneswuerbach commented Sep 20, 2019

fejta-bot commented Dec 19, 2019

bskiba commented Jan 17, 2020

bskiba commented Jan 17, 2020

k8s-ci-robot commented Jan 17, 2020

VPA: Protect against runaway pods #2359

VPA: Protect against runaway pods #2359

Comments

johanneswuerbach commented Sep 18, 2019

johanneswuerbach commented Sep 18, 2019

bskiba commented Sep 20, 2019 • edited Loading

phsiao commented Sep 20, 2019

johanneswuerbach commented Sep 20, 2019

fejta-bot commented Dec 19, 2019

bskiba commented Jan 17, 2020

bskiba commented Jan 17, 2020

k8s-ci-robot commented Jan 17, 2020

bskiba commented Sep 20, 2019 •

edited

Loading