Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in the controller #2639

Closed
dkoshkin opened this issue Sep 7, 2022 · 23 comments
Closed

Memory leak in the controller #2639

dkoshkin opened this issue Sep 7, 2022 · 23 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@dkoshkin
Copy link

dkoshkin commented Sep 7, 2022

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

What steps did you take and what happened:
[A clear and concise description of what the bug is.]
Deploy CAPZ v1.3.2 and let it run for some time.
Memory leak in the controller even when there are no Azure/AKS clusters.

Here is a graph with just a single non-Azure/AKS cluster:
image

A steeper graph when scaling to ~100 clusters (still 0 Azure/AKS clusters)
image

What did you expect to happen:
The controller should be stable when there are no Azure/AKS clusters.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
I enabled pprof to capture some info, here is what it looked like when first starting up
image

And then again after a couple of hours:
image

It pointed us to a potential culprit, a growing slice:

s.vals = append(s.vals, keysAndValues...)

I added some debug logging on the length and capacity of that slice and after a few hours its >100,000 KV pairs.


A different view in pprof provided some more info on where to look:
image

It doesn't give the exact function, but some hints towards

WithEventFilter(predicates.ResourceNotPausedAndHasFilterLabel(log, acr.WatchFilterValue)).

Looks like the log is instantiated once when SetupWithManager is called in main.go

func (acr *AzureClusterReconciler) SetupWithManager(ctx context.Context, mgr ctrl.Manager, options Options) error {
_, log, done := tele.StartSpanWithLogger(ctx,
"controllers.AzureClusterReconciler.SetupWithManager",
tele.KVP("controller", "AzureCluster"),
)
defer done()

This is different from how other providers setup their loggers and gets a new instance of it on every reconcile and does not have a growing slice of KV pairs.

Environment:

  • cluster-api-provider-azure version: v1.3.2
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 7, 2022
@dkoshkin
Copy link
Author

dkoshkin commented Sep 7, 2022

I thought that maybe the way the KVs are appeneded to the slice may be causing an issue in Go and tried a similar approach to how its done in https://github.com/kubernetes/klog/blob/02fe3234c86b222bc85406122bd0d11acbe5363a/internal/serialize/keyvalues.go#L39-L42

But that just leads to a spiky memory leak. Notice how the spikes happen every 10 minutes, on a periodic reconcile.
image

All sings point to a bug in how the log gets reused and KVs are appended to it across multiple reconciles.

@mboersma
Copy link
Contributor

mboersma commented Sep 8, 2022

@dkoshkin this is excellent (not that there's a leak, but that you did this profiling work)!

I assume the problem exists in 1.4.2 and main as well, but when time permits I'll try pprof on that.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 7, 2022
@jackfrancis
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 7, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 7, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 6, 2023
@nawazkh
Copy link
Member

nawazkh commented Apr 6, 2023

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 6, 2023
@jonathanbeber
Copy link

The problem still exists in v1.7.1

Screenshot_2023-04-19_17-52-09

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2023
@mboersma
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 20, 2023
@sonasingh46 sonasingh46 added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Aug 3, 2023
@dtzar
Copy link
Contributor

dtzar commented Aug 3, 2023

Matt couldn't reproduce this behavior and we are going to be doing extensive performance testing with #3547
It has also been a long time since this issue has been opened (with numerous releases since then).
For these reasons, we're marking this as needs-more-evidence.

@dtzar dtzar moved this to Wait-On-Author in CAPZ Planning Aug 3, 2023
@jonathanbeber
Copy link

I'm still seeing that in 1.8.5, is it already too old to be considered?

Otherwise I'm happy to provide more logs/details.

Screenshot_2023-08-03_17-03-53

@dtzar
Copy link
Contributor

dtzar commented Aug 9, 2023

I'd like to see if it still exists in the latest 1.10.x.
I also need to ask - Is the load on CAPZ increasing over time by chance? e.g. are you adding more clusters to manage over the course of a day?

@jonathanbeber
Copy link

jonathanbeber commented Aug 10, 2023

@dtzar this deployment is not managing any CAPZ resource. I will test and report 1.10.x in a few days/weeks and report back.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 25, 2024
@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 25, 2024
@nawazkh
Copy link
Member

nawazkh commented Feb 26, 2024

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 26, 2024
@nawazkh
Copy link
Member

nawazkh commented Feb 26, 2024

@dtzar this deployment is not managing any CAPZ resource. I will test and report 1.10.x in a few days/weeks and report back.

Hey @jonathanbeber , Did you get a chance to probe further. Or do you have any recommended steps in probing this issue so that we can take it over and check it ourselves?

@jonathanbeber
Copy link

I'm sorry, @nawazkh, I don't have any contact with CAPZ controllers anymore. You will be able to see the you by simply running the controller and having some metrics collector in place (e.g. prometheus + grafana).

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 26, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 25, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 25, 2024
@github-project-automation github-project-automation bot moved this from Wait-On-Author to Done in CAPZ Planning Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
Archived in project
Development

No branches or pull requests

9 participants