Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Goroutine memory leak #397

Merged
merged 4 commits into from
Sep 19, 2023
Merged

fix: Goroutine memory leak #397

merged 4 commits into from
Sep 19, 2023

Conversation

Jooho
Copy link
Contributor

@Jooho Jooho commented Jun 13, 2023

Motivation

The goroutines watching ETCD data are never released from memory until the process is down because the logic to stop the goroutines does not exist. It only deletes Service

Moreover, it tries to get data for a namespace even if the namespace does not exist. It causes a lot of garbage logs like the following

{"level":"error","ts":"2023-06-13T19:37:13Z","msg":"Reconciler error","controller":"ServiceReconciler","controllerGroup":"","controllerKind":"Namespace","Namespace":{"name":"openshift-build-test-28111211-5dttn"},"namespace":"","name":"openshift-build-test-28111211-5dttn","reconcileID":"4daa4200-cc10-4365-81e6-54840dffa332","error":"Namespace \"openshift-build-test-28111211-5dttn\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235"}

ISSUE: #385

Modifications

  • Goruntines will be stopped when the label modelmesh-enabled is removed from a namespace or the namespace is deleted.
  • If the request namespace does not exist, it will not retry to get the information again.

Result

When the lable modelmesh-enabled is added to a namespace, 7 Goroutines are created but if the label is removed or the namespace is removed, 7 Goroutines will be stopped. The Goroutines count can be checked with pprof.
Unfortunately, Modelmesh does not have pprof pkg so it is not possible to check it directly.

In order to test this, pprof pkg needs to be added to main. This PR has the code example. If Modelmesh needs this part, I will contribute this with another PR.

@Jooho Jooho self-assigned this Jul 11, 2023
@Jooho Jooho added the bug Something isn't working label Jul 11, 2023
Copy link
Member

@ckadner ckadner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Jooho -- this is my third attempt to fully grasp the problem and your code changes. Without adding 1000 namespaces and a memory profiler, do you have a reasonably easy way to test the code before and after. Even a manual setup with 2 or 3 namespaces and a few kubectl commands to verify the state of the cluster/MM deployment resources and/or log messages.

@@ -24,6 +24,10 @@ import (

func modelMeshEnabled(n *corev1.Namespace, controllerNamespace string) bool {
if v, ok := n.Labels["modelmesh-enabled"]; ok {
// Returns false if the namespace state is terminating even though the namespace have the 'modelmesh-enabled=true' label.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is helpful. Could you add comments to each of the changed code blocks above to describe what the code does and why (what problem it is solving).

@@ -119,24 +119,34 @@ func (r *ServiceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
namespace = req.Name
n := &corev1.Namespace{}
if err := r.Client.Get(ctx, req.NamespacedName, n); err != nil {
if k8serr.IsNotFound(err) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, controller keep checking namespaces even though the namespaces do not exist anymore.
As a result, a lot of misleading error messages showed up in the log. (I don't find the snippet of the message now)

Copy link
Contributor

@tjohnson31415 tjohnson31415 Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I think the ServingRuntime controller also has this problem (erroring when reconciling after a namespace is deleted). I was getting similar error logs from it. But that can be resolved in a separate PR.

I noticed that there is a difference in logic in this PR whether the namespace is terminating (cleanup resources created for the namespace) or if the namespace doesn't exist (skip reconciliation). I think there is a chance that a reconciliation might not see that a namespace is in the terminating state by the time it reconciles. If the namespace has been deleted, we'd still want to ensure disconnect/cleanup of the MMService and run ModelEventStream.RemoveWatchedService. If the cleanup is moved to the namespace-deleted case then we would not even need the check to see if the namespace is terminating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjohnson31415 Thanks for the review. From my 2 cents, the operator keeps watching the namespace so I don't think there is a possibility of missing the terminated status of the Namespace object. When any namespace changes its status, this operator detects it. WDYT?

Regarding the IsNotFound part, do you want me to update this pr without it? and send another pr to update it when this original PR is merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more comment about the namespace object status. The main issue was the operator does not have logic when the namespace is under Terminating status(source). From my understanding, Terminating is the only problematic status in this procedure

  • Active + modelmesh-enabled" label ==> modelMeshEnabled return true
  • Terminating + modelmesh-enabled" label ==> modelMeshEnabled return true
  • Active + no label ==> modelMeshEnabled return false
    The other status does not impact this logic.

The most important point here is, that when any namespace's status changes, this reconclie will be called and check the above logic. When any namespace gets terminate signal, the status will be changed to Terminating.

please correct me if something is wrong here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the "happy path" case, I think that reconciliation should always be able to observe the terminating state. In error/edge cases (eg. many namespaces deleted in bulk, error encountered during reconciliation and a requeue happening), I'm not so sure since "Terminating" is a temporary status.

Before this PR, the reconciliation handled neither "Terminating" nor "Not Found". This PR implements clean-up if the namespace is seen as Terminating, but, If reconciliation happens for a namespace that no longer exists, should the code:
(a) Assume that in-memory resources for that namespace do not exists and just skip reconciliation (i.e. assume that if modelmesh was enabled for that namespace that it previously reconciled while it was Terminating)
(b) Check for in-memory resources that exist and clean them up if they do

With the "level-based" reconciliation that Operators are meant to implement, I think (b) makes more sense as it reconciles based on the current state, without assumption of past reconciliation.

Comment on lines +129 to +135
if err := r.List(ctx, sl, client.HasLabels{"modelmesh-service"}, client.InNamespace(namespace)); err != nil {
return ctrl.Result{}, err
} else {
for i := range sl.Items {
s := &sl.Items[i]
if err2 := r.Delete(ctx, s); err2 != nil && err == nil {
err = err2
if err := r.Delete(ctx, s); err != nil {
return ctrl.Result{}, err
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same logic but I tried to separate error handling.
The logic is

  • If the namespace is not for modelmesh anymore, it will delete modelmesh Service when it exists.
  • If the namespace is terminated, it does not need to delete the modelmesh Service because it will be gone with the namespace

Comment on lines 143 to 146
//requeue is never expected here
if _, err, _ := r.reconcileService(ctx, mms, namespace, owner); err != nil {
return ctrl.Result{}, err
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the namespace is not for modelmesh anymore, it should trigger reconcileService for MMService list that manages the goroutines.

Comment on lines 239 to 250

n := &corev1.Namespace{}
if err := r.Client.Get(ctx, types.NamespacedName{Name: namespace}, n); err != nil {
return nil, err, false
}

if !modelMeshEnabled(n, r.ControllerDeployment.Namespace) {
r.ModelEventStream.RemoveWatchedService(serviceName, namespace)
r.Log.Info("Deleted Watched Service", "name", serviceName, "namespace", namespace)
return nil, nil, false
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will remove the goroutine when modelmesh is not enabled for a namespace.

  • when the namespace does not have the annotation modelmesh-enabled
  • when the namespace is under a Terminating state.

@rafvasq rafvasq changed the title fix gorutine memory leak fix: Goroutine memory leak Jul 31, 2023
@ckadner ckadner linked an issue Aug 15, 2023 that may be closed by this pull request
@Jooho
Copy link
Contributor Author

Jooho commented Aug 29, 2023

@ckadner added comments to the source.

Copy link
Contributor

@tjohnson31415 tjohnson31415 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to verify that these changes prevent a growth in the number of go routines when creating and deleting namespaces. To see the count of go routines, I used the pprof patch linked above.

The recent change to add more comments has caused the build to fail because the make fmt check doesn't pass.

Also, I'm not sure that the reconciler will always see the namespace as terminating. The cleanup that currently happens if the namespace is terminating could be moved to happen when the namespace has been deleted. Then there wouldn't need to be an additional check for the terminating state.

@@ -119,24 +119,34 @@ func (r *ServiceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
namespace = req.Name
n := &corev1.Namespace{}
if err := r.Client.Get(ctx, req.NamespacedName, n); err != nil {
if k8serr.IsNotFound(err) {
Copy link
Contributor

@tjohnson31415 tjohnson31415 Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I think the ServingRuntime controller also has this problem (erroring when reconciling after a namespace is deleted). I was getting similar error logs from it. But that can be resolved in a separate PR.

I noticed that there is a difference in logic in this PR whether the namespace is terminating (cleanup resources created for the namespace) or if the namespace doesn't exist (skip reconciliation). I think there is a chance that a reconciliation might not see that a namespace is in the terminating state by the time it reconciles. If the namespace has been deleted, we'd still want to ensure disconnect/cleanup of the MMService and run ModelEventStream.RemoveWatchedService. If the cleanup is moved to the namespace-deleted case then we would not even need the check to see if the namespace is terminating

Signed-off-by: jooho <[email protected]>
@ckadner ckadner removed the bug Something isn't working label Sep 12, 2023
Copy link
Member

@ckadner ckadner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jooho

/lgtm

@kserve-oss-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ckadner, Jooho
To complete the pull request process, please assign tjohnson31415 after the PR has been reviewed.
You can assign the PR to them by writing /assign @tjohnson31415 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory leak in go routine unmarshalling secrets
4 participants