feat(helm chart): Make env vars configurable and auto configure go runtime #1412

sslavic · 2024-01-26T12:02:11Z

What this PR does / why we need it: PR extends the metrics-server helm chart with support for configuring environment variables and it automatically configures Go runtime (GOMAXPROCS and GOMEMLIMIT) to make the runtime aware of resources assigned to the metrics-server and addon-resizer containers. This reduces likelihood of CPUThrottlingHigh paging and OOM crashes.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

k8s-ci-robot · 2024-01-26T12:02:20Z

Welcome @sslavic!

It looks like this is your first PR to kubernetes-sigs/metrics-server 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/metrics-server has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-01-26T12:02:21Z

Hi @sslavic. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stevehipwell

Thanks for the PR @sslavic. I think a boolean value to set the GOMAXPROCS & GOMEMLIMIT API might be better, I think it could be a single value for the whole chart. Then adding extraEnvVars to each container would allow this to be infinitely customisable for anyone who wanted an alternate implementation.

sslavic · 2024-01-29T22:03:13Z

@stevehipwell PTAL

stevehipwell · 2024-01-30T10:22:26Z

Thanks @sslavic, this looks good to me in principal.

@serathius can you see any issue with setting GOMAXPROCS & GOMEMLIMIT by default?

/ok-to-test

asychev · 2024-02-07T14:27:38Z

Could we please move this forward to unblock 0.7.0 chart release?

dgrisonnet · 2024-02-08T18:00:53Z

/triage accepted
/assign @stevehipwell

k8s-ci-robot · 2024-02-08T19:21:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sslavic
Once this PR has been reviewed and has the lgtm label, please ask for approval from stevehipwell. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

charts/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

stevehipwell · 2024-02-14T10:54:47Z

@dgrisonnet I'm happy from the Helm perspective but I'd like a second opinion form one of the core maintainers.

@sslavic could you add an entry under the [UNRELEASED] section in the Helm chart CHANGELOG covering what you've done?

…ntime Signed-off-by: Stevo Slavic <[email protected]>

sslavic · 2024-02-14T12:44:25Z

Chart CHANGELOG has been updated, @stevehipwell PTAL

serathius · 2024-02-15T09:13:45Z

@serathius can you see any issue with setting GOMAXPROCS & GOMEMLIMIT by default?

Hard to say, I haven't seen those variables used anywhere in Kubernetes ecosystem, maybe because lack of awareness, maybe because it doesn't bring any tangible benefit. I would not recommend making this a default setting without prior testing.

serathius · 2024-02-15T09:19:21Z

Looking at golang/go#33803, it proposes to set GOMAXPROCS=max(1, floor(cpu_quota)). Implying that what PR proposes is not very good.

sslavic · 2024-02-15T10:44:34Z

what PR proposes is not very good

@serathius can you please expand on this? Please also take into consideration and compare the tradeoffs involved against current state where Go runtime for metrics-server (and sidecar) is left to defaults which e.g. for GKE managed metrics-server results in lots of CPUThrottlingHigh paging.

This PR is workaround for Go runtime issue golang/go#33803

Many projects use https://github.com/uber-go/automaxprocs as workaround.

automaxprocs uses CPU limits. CPU limits are not typically set on containers in k8s (and rightfully so, e.g. see https://home.robusta.dev/blog/stop-using-cpu-limits). metrics-server default assigned resources also don't set limits. Therefore, this PR auto-configures Go runtime based on CPU requests.

Using automaxprocs would be even more invasive, compared to using approach this PR proposes.

GOMEMLIMIT is very useful too, but relatively new - can't expect many project to be using it at this point.

The new defaults can be opted out completely or tuned, by

adjusting CPU requests / memory limits, and/or by
disabling automatic tuning and configuring additional environment variables to their liking.

IMO these new defaults make metrics-server better out of the box, reducing the chance of CPUThrottlingHigh paging. Hope is GKE managed metrics-server will have this change propagated to it too 🤞🏻

stevehipwell · 2024-02-15T19:46:39Z

@serathius is there a reason why MS couldn't use uber-go/automaxprocs?

sslavic · 2024-02-16T13:55:55Z

automaxprocs is based on CPU limits, so in case of metrics-server which has no limits by default it wouldn't change anything - that is good in backward compatibility perspective, but not for the goal which is to reduce chance of the CPU throttling high issue by default out of the box. Btw there are articles devoted to this issue, see https://github.com/robusta-dev/alert-explanations/wiki/CPUThrottlingHigh-on-metrics-server-(Prometheus-alert) by @aantn - IMO it's not good that e.g. on GKE the only option is to silence the alert and let metrics-server misbehave, live with its Go runtime not being configured.

Using automaxprocs has another downside compared to the solution proposed in the PR - it can't be as easily opted out, we'd need at least extra env vars support for that; even then it wouldn't be as effective e.g. when it comes to ease of propagating the high CPU throttling fix by default even to the managed metrics-server services like the one on GKE.

sslavic · 2024-02-16T14:10:58Z

Uber may open-source automaxprocs equivalent for GOMEMLIMIT uber-go/automaxprocs#56 (comment)

I still think env vars calculated from resources assigned in the infra code is more transparent, lighter weight, less invasive and more flexible when compared to using the libraries.

serathius · 2024-02-16T15:54:19Z

@serathius is there a reason why MS couldn't use uber-go/automaxprocs?

No, just someone needs to test it, compile results, show improvement, and send a PR.

@serathius can you please expand on this?

Just that proposed solution is not a complete fix and without a tests showing an improvement we should not enable it by default.

My suggestion would be to keep MS components and helm releases consistent. If we want to add envs in helm, I would recommend not making Go ens default, but wait for the binary to test and adopt https://github.com/uber-go/automaxprocs

k8s-ci-robot · 2024-04-04T23:40:41Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jnoordsij · 2024-04-30T15:54:34Z

I ran into this while proposing similar changes in other Helm charts, inspired by traefik/traefik-helm-chart#1029. I'm definitely not the expert on whether or not these changes are actually beneficial, but on that PR there are a series of references that look quite promising to me. Maybe they can help in understanding the possible benefits of merging this?

k8s-triage-robot · 2024-07-29T16:34:37Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-08-28T16:56:37Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-09-27T17:37:23Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-09-27T17:37:28Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 26, 2024

k8s-ci-robot requested review from stevehipwell and yangjunmyfm192085 January 26, 2024 12:02

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 26, 2024

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 26, 2024

sslavic force-pushed the helm-chart-env-go branch from 8c0f077 to 9a9d2a4 Compare January 26, 2024 12:34

stevehipwell suggested changes Jan 29, 2024

View reviewed changes

sslavic force-pushed the helm-chart-env-go branch from 9a9d2a4 to e8a8138 Compare January 29, 2024 22:01

sslavic requested a review from stevehipwell January 29, 2024 22:03

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 30, 2024

This was referenced Jan 30, 2024

feat(chart): Released v3.12.0 (v0.7.0) #1414

Merged

Release Helm chart for v0.7.0 #1409

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2024

k8s-ci-robot assigned stevehipwell Feb 8, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 8, 2024

sslavic force-pushed the helm-chart-env-go branch from e8a8138 to 20e4322 Compare February 8, 2024 19:21

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 9, 2024

feat(helm chart): Make env vars configurable and auto configure go ru…

755bec0

…ntime Signed-off-by: Stevo Slavic <[email protected]>

sslavic force-pushed the helm-chart-env-go branch from 20e4322 to 755bec0 Compare February 14, 2024 12:42

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 28, 2024

k8s-ci-robot closed this Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(helm chart): Make env vars configurable and auto configure go runtime #1412

feat(helm chart): Make env vars configurable and auto configure go runtime #1412

sslavic commented Jan 26, 2024

k8s-ci-robot commented Jan 26, 2024

k8s-ci-robot commented Jan 26, 2024

stevehipwell left a comment

sslavic commented Jan 29, 2024

stevehipwell commented Jan 30, 2024

asychev commented Feb 7, 2024

dgrisonnet commented Feb 8, 2024

k8s-ci-robot commented Feb 8, 2024

stevehipwell commented Feb 14, 2024

sslavic commented Feb 14, 2024

serathius commented Feb 15, 2024

serathius commented Feb 15, 2024

sslavic commented Feb 15, 2024

stevehipwell commented Feb 15, 2024

sslavic commented Feb 16, 2024

sslavic commented Feb 16, 2024 •

edited

Loading

serathius commented Feb 16, 2024

k8s-ci-robot commented Apr 4, 2024

jnoordsij commented Apr 30, 2024

k8s-triage-robot commented Jul 29, 2024

k8s-triage-robot commented Aug 28, 2024

k8s-triage-robot commented Sep 27, 2024

k8s-ci-robot commented Sep 27, 2024

feat(helm chart): Make env vars configurable and auto configure go runtime #1412

feat(helm chart): Make env vars configurable and auto configure go runtime #1412

Conversation

sslavic commented Jan 26, 2024

k8s-ci-robot commented Jan 26, 2024

k8s-ci-robot commented Jan 26, 2024

stevehipwell left a comment

Choose a reason for hiding this comment

sslavic commented Jan 29, 2024

stevehipwell commented Jan 30, 2024

asychev commented Feb 7, 2024

dgrisonnet commented Feb 8, 2024

k8s-ci-robot commented Feb 8, 2024

stevehipwell commented Feb 14, 2024

sslavic commented Feb 14, 2024

serathius commented Feb 15, 2024

serathius commented Feb 15, 2024

sslavic commented Feb 15, 2024

stevehipwell commented Feb 15, 2024

sslavic commented Feb 16, 2024

sslavic commented Feb 16, 2024 • edited Loading

serathius commented Feb 16, 2024

k8s-ci-robot commented Apr 4, 2024

jnoordsij commented Apr 30, 2024

k8s-triage-robot commented Jul 29, 2024

k8s-triage-robot commented Aug 28, 2024

k8s-triage-robot commented Sep 27, 2024

k8s-ci-robot commented Sep 27, 2024

sslavic commented Feb 16, 2024 •

edited

Loading