-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in the controller #2639
Comments
I thought that maybe the way the KVs are appeneded to the slice may be causing an issue in Go and tried a similar approach to how its done in https://github.com/kubernetes/klog/blob/02fe3234c86b222bc85406122bd0d11acbe5363a/internal/serialize/keyvalues.go#L39-L42 But that just leads to a spiky memory leak. Notice how the spikes happen every 10 minutes, on a periodic reconcile. All sings point to a bug in how the |
@dkoshkin this is excellent (not that there's a leak, but that you did this profiling work)! I assume the problem exists in 1.4.2 and main as well, but when time permits I'll try |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Matt couldn't reproduce this behavior and we are going to be doing extensive performance testing with #3547 |
I'd like to see if it still exists in the latest 1.10.x. |
@dtzar this deployment is not managing any CAPZ resource. I will test and report 1.10.x in a few days/weeks and report back. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
Hey @jonathanbeber , Did you get a chance to probe further. Or do you have any recommended steps in probing this issue so that we can take it over and check it ourselves? |
I'm sorry, @nawazkh, I don't have any contact with CAPZ controllers anymore. You will be able to see the you by simply running the controller and having some metrics collector in place (e.g. prometheus + grafana). |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/kind bug
[Before submitting an issue, have you checked the Troubleshooting Guide?]
What steps did you take and what happened:
[A clear and concise description of what the bug is.]
Deploy CAPZ
v1.3.2
and let it run for some time.Memory leak in the controller even when there are no Azure/AKS clusters.
Here is a graph with just a single non-Azure/AKS cluster:
A steeper graph when scaling to ~100 clusters (still 0 Azure/AKS clusters)
What did you expect to happen:
The controller should be stable when there are no Azure/AKS clusters.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
I enabled pprof to capture some info, here is what it looked like when first starting up
And then again after a couple of hours:
It pointed us to a potential culprit, a growing slice:
cluster-api-provider-azure/util/tele/span_logger.go
Line 92 in 392137a
I added some debug logging on the length and capacity of that slice and after a few hours its >100,000 KV pairs.
A different view in pprof provided some more info on where to look:
It doesn't give the exact function, but some hints towards
cluster-api-provider-azure/controllers/azurecluster_controller.go
Line 88 in 392137a
Looks like the
log
is instantiated once whenSetupWithManager
is called inmain.go
cluster-api-provider-azure/controllers/azurecluster_controller.go
Lines 73 to 78 in 392137a
This is different from how other providers setup their loggers and gets a new instance of it on every reconcile and does not have a growing slice of KV pairs.
Environment:
v1.3.2
kubectl version
):/etc/os-release
):The text was updated successfully, but these errors were encountered: