Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support APIReader/NoCache option for ControllerUtil.CreateOr(Update|Patch) functions to avoid OOM errors #1222

Closed
akutz opened this issue Oct 20, 2020 · 11 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@akutz
Copy link
Contributor

akutz commented Oct 20, 2020

Use Case(s)

As a Developer,
I would like to use controllerutil.CreateOrUpdate and/or controllerutil.CreateOrPatch without the Get calls in both functions resulting in the creation of an informer/watch for the specified object,
Because the implicit watch behavior creates scaling issues and OOM errors on pods that may not ever explicitly watch a ConfigMap, but do create one, on a cluster with thousands of ConfigMap resources.

Details

By default, upon creation, the controller-runtime Manager initializes a DelegatingClient, which reads from an internal cache and writes directly to the API server. This client uses a backing cache implementation that automatically sets up informers (aka watches) on resource types during Get and List calls.

This is by design, and is normally fine, but contributes to issues at scale. Imagine a controller that creates ConfigMap resources, but never explicitly watches them. That controller may be running in a pod with a 30MiB memory limit. The pod will be terminated with an out-of-memory (OOM) error on a cluster with 100-120 ConfigMap resources at ~50 KiB each -- resources that have nothing at all to do with this controller.

The Manager interface does support a GetAPIReader() client.Reader function that returns a client that will bypass the informer cache. The problem becomes the functions controllerutil.CreateOrUpdate and/or controllerutil.CreateOrPatch. These functions both share similar signatures and logic:

func CreateOrXYZ(ctx context.Context, c client.Client, obj runtime.Object, mutateFn func() error) (OperationResult, error) {
	key, err := client.ObjectKeyFromObject(obj)
	if err != nil {
		return OperationResultNone, err
	}

	if err := c.Get(ctx, key, obj); err != nil {
		if !errors.IsNotFound(err) {
			return OperationResultNone, err
		}
		if err := mutate(f, key, obj); err != nil {
			return OperationResultNone, err
		}
		if err := c.Create(ctx, obj); err != nil {
			return OperationResultNone, err
		}
		return OperationResultCreated, nil
	}

	// ...

	if err := c.XYZ(ctx, obj); err != nil {
		return OperationResultNone, err
	}

	return OperationResultXYZ, nil
}

Both functions must first Get the object before attempting a Create or Update/Patch call. This means that currently these functions cannot take only an APIReader. The argument given to the client.Client parameter must be a client that can both read and write.

The bottom line is the use of the helper functions that are prevalent throughout much our code, and no doubt many others', contribute to scaling problems because of the by-design side-effect of an implicit watch on objects via Get calls.

@vincepri
Copy link
Member

Would it make sense to have an option for the client to ignore cached reads?

@dvob
Copy link
Contributor

dvob commented Oct 20, 2020

I'm surprised by how the default client.Client behaves. I have assumed that only resources get cached (watched) for which we explicitly call controller.Watch, Builder.Watches, Builder.For, Builder.Owns or similar.
For me the expected behavior for the default client.Client would be to serve a resource on Get or List from the cache if it is in the cache and otherwise get it from the cluster but not start to cache new resources implicitly.

@akutz
Copy link
Contributor Author

akutz commented Oct 21, 2020

Hi @dsbrng25b ,

For me the expected behavior for the default client.Client would be to serve a resource on Get or List from the cache if it is in the cache and otherwise get it from the cluster but not start to cache new resources implicitly.

I thought the same as well, but then I realized that it doesn't make much sense to have any objects cached for which there aren't informers. Otherwise you may be getting stale data. And if you cache upon a get call, wouldn't a subsequent get just cache it again? Anyway, I get (pun intended) why they added implicit watches, but at the same time, I also agree this type of behavior shouldn't be a side-effect, and even possibly opt-in.

@alvaroaleman
Copy link
Member

The behavior of the cache is indeed a debatable thing (in both directions). So far, no one bothered to start that discussion 🙃

As for the problem at hand, I don't think its possible to change any about this is CreateOr{Patch,Update}. They get a client, they use it.

You might be able to solve your problem by overriding the NewClient func on the Manager. The default already redirects access to *unstructured.Unstructed to the APIReader., you could add something similiar for the types you do not want to cache.

Another approach I've used in the past is to just build a client out of the APIReader and the default cache-backed client:
https://github.com/kubermatic/kubermatic/blob/1a7b80bb028319bae71e6cd1d3b72d44c91b896a/pkg/controller/user-cluster-controller-manager/resources/controller.go#L286-L296

@akutz
Copy link
Contributor Author

akutz commented Oct 21, 2020

Another approach I've used in the past is to just build a client out of the APIReader and the default cache-backed client

Sort of like...

image

What do they say about crazy minds thinking alike? :)

@akutz
Copy link
Contributor Author

akutz commented Oct 21, 2020

At minimum @alvaroaleman, I think we should do a better job of highlighting this behavior in documentation. It was completely unexpected, and while I understand the reasons, we should make sure others are aware of the design and how to mitigate/work-around concerns/situations similar to what I've outlined in this issue.

@timebertt
Copy link
Contributor

We hit the same issue in our clusters as well.

You might be able to solve your problem by overriding the NewClient func on the Manager. The default already redirects access to *unstructured.Unstructed to the APIReader., you could add something similiar for the types you do not want to cache.

To eliminate unnecessary network traffic and decrease the memory footprint of our controllers in the short-term, we basically implemented exactly what @alvaroaleman suggested (see gardener/gardener#2940).

The usage basically looks something like this:

mgr, err := manager.New(restConfig, manager.Options{
	NewClient: utils.NewClientFuncWithDisabledCacheFor(
		&corev1.Secret{},
		&corev1.ConfigMap{},
	),
})

(e.g. ref gardener/gardener-extension-networking-calico#55)

The client implementation can either only cache objects of the specified GKs or disable the cache for the specified GKs.

The naming might be horrible, but I'm definitely open to contribute this back to c-r, if there is interest in it :)

@vincepri
Copy link
Member

vincepri commented Nov 5, 2020

A proposed solution that uses the delegating client and offers a builder is in #1249, feel free to leave feedback and comments.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 3, 2021
@vincepri
Copy link
Member

vincepri commented Feb 3, 2021

Fixed in #1249

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

Fixed in #1249

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

7 participants