Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(helm chart): Make env vars configurable and auto configure go runtime #1412

Closed
wants to merge 1 commit into from

Conversation

sslavic
Copy link

@sslavic sslavic commented Jan 26, 2024

What this PR does / why we need it: PR extends the metrics-server helm chart with support for configuring environment variables and it automatically configures Go runtime (GOMAXPROCS and GOMEMLIMIT) to make the runtime aware of resources assigned to the metrics-server and addon-resizer containers. This reduces likelihood of CPUThrottlingHigh paging and OOM crashes.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 26, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @sslavic!

It looks like this is your first PR to kubernetes-sigs/metrics-server 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/metrics-server has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 26, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @sslavic. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 26, 2024
Copy link
Contributor

@stevehipwell stevehipwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @sslavic. I think a boolean value to set the GOMAXPROCS & GOMEMLIMIT API might be better, I think it could be a single value for the whole chart. Then adding extraEnvVars to each container would allow this to be infinitely customisable for anyone who wanted an alternate implementation.

@sslavic
Copy link
Author

sslavic commented Jan 29, 2024

@stevehipwell PTAL

@sslavic sslavic requested a review from stevehipwell January 29, 2024 22:03
@stevehipwell
Copy link
Contributor

Thanks @sslavic, this looks good to me in principal.

@serathius can you see any issue with setting GOMAXPROCS & GOMEMLIMIT by default?

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 30, 2024
@asychev
Copy link

asychev commented Feb 7, 2024

Could we please move this forward to unblock 0.7.0 chart release?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2024
@dgrisonnet
Copy link
Member

/triage accepted
/assign @stevehipwell

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 8, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sslavic
Once this PR has been reviewed and has the lgtm label, please ask for approval from stevehipwell. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 9, 2024
@stevehipwell
Copy link
Contributor

@dgrisonnet I'm happy from the Helm perspective but I'd like a second opinion form one of the core maintainers.

@sslavic could you add an entry under the [UNRELEASED] section in the Helm chart CHANGELOG covering what you've done?

@sslavic
Copy link
Author

sslavic commented Feb 14, 2024

Chart CHANGELOG has been updated, @stevehipwell PTAL

@serathius
Copy link
Contributor

@serathius can you see any issue with setting GOMAXPROCS & GOMEMLIMIT by default?

Hard to say, I haven't seen those variables used anywhere in Kubernetes ecosystem, maybe because lack of awareness, maybe because it doesn't bring any tangible benefit. I would not recommend making this a default setting without prior testing.

@serathius
Copy link
Contributor

Looking at golang/go#33803, it proposes to set GOMAXPROCS=max(1, floor(cpu_quota)). Implying that what PR proposes is not very good.

@sslavic
Copy link
Author

sslavic commented Feb 15, 2024

what PR proposes is not very good

@serathius can you please expand on this? Please also take into consideration and compare the tradeoffs involved against current state where Go runtime for metrics-server (and sidecar) is left to defaults which e.g. for GKE managed metrics-server results in lots of CPUThrottlingHigh paging.

This PR is workaround for Go runtime issue golang/go#33803

Many projects use https://github.com/uber-go/automaxprocs as workaround.

automaxprocs uses CPU limits. CPU limits are not typically set on containers in k8s (and rightfully so, e.g. see https://home.robusta.dev/blog/stop-using-cpu-limits). metrics-server default assigned resources also don't set limits. Therefore, this PR auto-configures Go runtime based on CPU requests.

Using automaxprocs would be even more invasive, compared to using approach this PR proposes.

GOMEMLIMIT is very useful too, but relatively new - can't expect many project to be using it at this point.

The new defaults can be opted out completely or tuned, by

  • adjusting CPU requests / memory limits, and/or by
  • disabling automatic tuning and configuring additional environment variables to their liking.

IMO these new defaults make metrics-server better out of the box, reducing the chance of CPUThrottlingHigh paging. Hope is GKE managed metrics-server will have this change propagated to it too 🤞🏻

@stevehipwell
Copy link
Contributor

@serathius is there a reason why MS couldn't use uber-go/automaxprocs?

@sslavic
Copy link
Author

sslavic commented Feb 16, 2024

automaxprocs is based on CPU limits, so in case of metrics-server which has no limits by default it wouldn't change anything - that is good in backward compatibility perspective, but not for the goal which is to reduce chance of the CPU throttling high issue by default out of the box. Btw there are articles devoted to this issue, see https://github.com/robusta-dev/alert-explanations/wiki/CPUThrottlingHigh-on-metrics-server-(Prometheus-alert) by @aantn - IMO it's not good that e.g. on GKE the only option is to silence the alert and let metrics-server misbehave, live with its Go runtime not being configured.

Using automaxprocs has another downside compared to the solution proposed in the PR - it can't be as easily opted out, we'd need at least extra env vars support for that; even then it wouldn't be as effective e.g. when it comes to ease of propagating the high CPU throttling fix by default even to the managed metrics-server services like the one on GKE.

@sslavic
Copy link
Author

sslavic commented Feb 16, 2024

Uber may open-source automaxprocs equivalent for GOMEMLIMIT uber-go/automaxprocs#56 (comment)

I still think env vars calculated from resources assigned in the infra code is more transparent, lighter weight, less invasive and more flexible when compared to using the libraries.

@serathius
Copy link
Contributor

@serathius is there a reason why MS couldn't use uber-go/automaxprocs?

No, just someone needs to test it, compile results, show improvement, and send a PR.

@serathius can you please expand on this?

Just that proposed solution is not a complete fix and without a tests showing an improvement we should not enable it by default.

My suggestion would be to keep MS components and helm releases consistent. If we want to add envs in helm, I would recommend not making Go ens default, but wait for the binary to test and adopt https://github.com/uber-go/automaxprocs

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jnoordsij
Copy link

I ran into this while proposing similar changes in other Helm charts, inspired by traefik/traefik-helm-chart#1029. I'm definitely not the expert on whether or not these changes are actually beneficial, but on that PR there are a series of references that look quite promising to me. Maybe they can help in understanding the possible benefits of merging this?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 28, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants