Memory leak in the controller #2639

dkoshkin · 2022-09-07T23:01:30Z

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

What steps did you take and what happened:
[A clear and concise description of what the bug is.]
Deploy CAPZ v1.3.2 and let it run for some time.
Memory leak in the controller even when there are no Azure/AKS clusters.

Here is a graph with just a single non-Azure/AKS cluster:

A steeper graph when scaling to ~100 clusters (still 0 Azure/AKS clusters)

What did you expect to happen:
The controller should be stable when there are no Azure/AKS clusters.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
I enabled pprof to capture some info, here is what it looked like when first starting up

And then again after a couple of hours:

It pointed us to a potential culprit, a growing slice:

cluster-api-provider-azure/util/tele/span_logger.go

Line 92 in 392137a

s.vals = append(s.vals, keysAndValues...)

I added some debug logging on the length and capacity of that slice and after a few hours its >100,000 KV pairs.

A different view in pprof provided some more info on where to look:

It doesn't give the exact function, but some hints towards

cluster-api-provider-azure/controllers/azurecluster_controller.go

Line 88 in 392137a

    
           WithEventFilter(predicates.ResourceNotPausedAndHasFilterLabel(log, acr.WatchFilterValue)).

Looks like the log is instantiated once when SetupWithManager is called in main.go

cluster-api-provider-azure/controllers/azurecluster_controller.go

Lines 73 to 78 in 392137a

    
           func (acr *AzureClusterReconciler) SetupWithManager(ctx context.Context, mgr ctrl.Manager, options Options) error { 
        
           	_, log, done := tele.StartSpanWithLogger(ctx, 
        
           		"controllers.AzureClusterReconciler.SetupWithManager", 
        
           		tele.KVP("controller", "AzureCluster"), 
        
           	) 
        
           	defer done()

This is different from how other providers setup their loggers and gets a new instance of it on every reconcile and does not have a growing slice of KV pairs.

Environment:

cluster-api-provider-azure version: v1.3.2
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

dkoshkin · 2022-09-07T23:05:35Z

I thought that maybe the way the KVs are appeneded to the slice may be causing an issue in Go and tried a similar approach to how its done in https://github.com/kubernetes/klog/blob/02fe3234c86b222bc85406122bd0d11acbe5363a/internal/serialize/keyvalues.go#L39-L42

But that just leads to a spiky memory leak. Notice how the spikes happen every 10 minutes, on a periodic reconcile.

All sings point to a bug in how the log gets reused and KVs are appended to it across multiple reconciles.

mboersma · 2022-09-08T16:33:57Z

@dkoshkin this is excellent (not that there's a leak, but that you did this profiling work)!

I assume the problem exists in 1.4.2 and main as well, but when time permits I'll try pprof on that.

k8s-triage-robot · 2022-12-07T16:36:35Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jackfrancis · 2022-12-07T19:17:48Z

/remove-lifecycle stale

k8s-triage-robot · 2023-03-07T19:37:02Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-04-06T20:05:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

nawazkh · 2023-04-06T21:47:05Z

/remove-lifecycle rotten

jonathanbeber · 2023-04-19T20:55:52Z

The problem still exists in v1.7.1

k8s-triage-robot · 2023-07-18T21:03:03Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mboersma · 2023-07-20T16:28:33Z

/remove-lifecycle stale

dtzar · 2023-08-03T17:15:53Z

Matt couldn't reproduce this behavior and we are going to be doing extensive performance testing with #3547
It has also been a long time since this issue has been opened (with numerous releases since then).
For these reasons, we're marking this as needs-more-evidence.

jonathanbeber · 2023-08-03T20:04:07Z

I'm still seeing that in 1.8.5, is it already too old to be considered?

Otherwise I'm happy to provide more logs/details.

dtzar · 2023-08-09T19:00:25Z

I'd like to see if it still exists in the latest 1.10.x.
I also need to ask - Is the load on CAPZ increasing over time by chance? e.g. are you adding more clusters to manage over the course of a day?

jonathanbeber · 2023-08-10T17:58:17Z

@dtzar this deployment is not managing any CAPZ resource. I will test and report 1.10.x in a few days/weeks and report back.

k8s-triage-robot · 2024-01-26T05:28:02Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-25T06:25:26Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

nawazkh · 2024-02-26T17:48:00Z

/remove-lifecycle rotten

nawazkh · 2024-02-26T17:49:36Z

@dtzar this deployment is not managing any CAPZ resource. I will test and report 1.10.x in a few days/weeks and report back.

Hey @jonathanbeber , Did you get a chance to probe further. Or do you have any recommended steps in probing this issue so that we can take it over and check it ourselves?

jonathanbeber · 2024-02-26T18:01:14Z

I'm sorry, @nawazkh, I don't have any contact with CAPZ controllers anymore. You will be able to see the you by simply running the controller and having some metrics collector in place (e.g. prometheus + grafana).

k8s-triage-robot · 2024-05-26T18:42:22Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-06-25T18:56:36Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-07-25T19:31:35Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-07-25T19:31:40Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 7, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 7, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 7, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 7, 2023

CecileRobertMichon added this to CAPZ Planning Apr 5, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 6, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 6, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 20, 2023

sonasingh46 added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Aug 3, 2023

dtzar moved this to Wait-On-Author in CAPZ Planning Aug 3, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 25, 2024

k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 25, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 26, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 26, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 25, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 25, 2024

github-project-automation bot moved this from Wait-On-Author to Done in CAPZ Planning Jul 25, 2024

andreipantelimon mentioned this issue Nov 6, 2024

CAPZ Controller Memory Leak #5245

Closed

nojnhuh mentioned this issue Nov 18, 2024

fix ever-accumulating memory in logger #5284

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in the controller #2639

Memory leak in the controller #2639

dkoshkin commented Sep 7, 2022

dkoshkin commented Sep 7, 2022

mboersma commented Sep 8, 2022

k8s-triage-robot commented Dec 7, 2022

jackfrancis commented Dec 7, 2022

k8s-triage-robot commented Mar 7, 2023

k8s-triage-robot commented Apr 6, 2023

nawazkh commented Apr 6, 2023

jonathanbeber commented Apr 19, 2023

k8s-triage-robot commented Jul 18, 2023

mboersma commented Jul 20, 2023

dtzar commented Aug 3, 2023

jonathanbeber commented Aug 3, 2023

dtzar commented Aug 9, 2023

jonathanbeber commented Aug 10, 2023 •

edited

Loading

k8s-triage-robot commented Jan 26, 2024

k8s-triage-robot commented Feb 25, 2024

nawazkh commented Feb 26, 2024

nawazkh commented Feb 26, 2024

jonathanbeber commented Feb 26, 2024

k8s-triage-robot commented May 26, 2024

k8s-triage-robot commented Jun 25, 2024

k8s-triage-robot commented Jul 25, 2024

k8s-ci-robot commented Jul 25, 2024

Memory leak in the controller #2639

Memory leak in the controller #2639

Comments

dkoshkin commented Sep 7, 2022

dkoshkin commented Sep 7, 2022

mboersma commented Sep 8, 2022

k8s-triage-robot commented Dec 7, 2022

jackfrancis commented Dec 7, 2022

k8s-triage-robot commented Mar 7, 2023

k8s-triage-robot commented Apr 6, 2023

nawazkh commented Apr 6, 2023

jonathanbeber commented Apr 19, 2023

k8s-triage-robot commented Jul 18, 2023

mboersma commented Jul 20, 2023

dtzar commented Aug 3, 2023

jonathanbeber commented Aug 3, 2023

dtzar commented Aug 9, 2023

jonathanbeber commented Aug 10, 2023 • edited Loading

k8s-triage-robot commented Jan 26, 2024

k8s-triage-robot commented Feb 25, 2024

nawazkh commented Feb 26, 2024

nawazkh commented Feb 26, 2024

jonathanbeber commented Feb 26, 2024

k8s-triage-robot commented May 26, 2024

k8s-triage-robot commented Jun 25, 2024

k8s-triage-robot commented Jul 25, 2024

k8s-ci-robot commented Jul 25, 2024

jonathanbeber commented Aug 10, 2023 •

edited

Loading