Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default operator (controller) memory resources are too low #1622

Open
bukovjanmic opened this issue Jul 29, 2024 · 14 comments
Open

Default operator (controller) memory resources are too low #1622

bukovjanmic opened this issue Jul 29, 2024 · 14 comments
Labels
bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@bukovjanmic
Copy link

Operator (as distributed via OLM) has default memory limit set to 550MiB and memory request to 20MiB.

Our monitoring shows that the memory usage spikes a bit higher, on our relatively small cluster (4 worker nodes, tens of Grafana CRs) the memory usage oscilates between 540Mi and 683Mi, without any intervention the operator gets continuously OOMKilled.

On our largest cluster, the memory consumption oscilates around 2-2.1 GB of memory.

While we know how to adjust the values for individual operator instances, the defaults should probably be changed to something reasonable, I suggest 600MiB as memory.request and 2.5GiB for memory.limit.

The default for memory request (20 MiB) is way off, it seems, the operator has never such consumption.

@bukovjanmic bukovjanmic added bug Something isn't working needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 29, 2024
@theSuess
Copy link
Member

@NissesSenap I know you're running a larger deployment of the operator - can you share your resource consumption to see if we should change the defaults here?

@NissesSenap
Copy link
Collaborator

I'm currently on PTO and don't have access to our metrics, but I can take a look at the end of the week when I'm back home.

@theSuess theSuess added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 5, 2024
@weisdd
Copy link
Collaborator

weisdd commented Aug 7, 2024

@bukovjanmic we agree that the default limits/requests are unlikely to be optimal, and we can certainly adjust the values. At the same time, the production memory consumption that you highlighted seems to be quite high, so we'd like to treat it as an opportunity to assess if 2Gi is reasonable or we can optimize it. @theSuess has added a pprof endpoint as part of the latest release, so the best way forward would be to both get insights on your environment (how many of each CRa you run, their resync period, etc) and pprof traces. The second best option if you can share the above-mentioned insights, so at least we can try to reproduce them. We understand that any of those might not be part of what you're willing to share publicly, so we can get in touch on Slack. Please, let us know how you'd like to continue with this :)

@bukovjanmic
Copy link
Author

Hi, no problem to share stats (green is memory request, yellow memory limit, blue real consumption).

Here is our test cluster consumption over 2 days, (operator version 5.11.0):

image

This cluster-wide instance is reconciling 5 Grafana CR, 48 GrafanaDashboards, 20 GrafanaDatasources, 1 GrafanaFolder

Here is our one of our production clusters over 2 days (operator version 5.9.2):

image

This cluster-wide instance is reconciling 33 Grafana CR, 187 GrafanaDashboards, 240 GrafanaDatasources, 2 GrafanaFolders.

All operators are installed via OLM with defaults (except for resources), the resnyc period is up to operator.

@bukovjanmic
Copy link
Author

Also, I put version 5.0.12 on our test cluster and see the pprof endpoint. Any pointers what to do with this?

@weisdd
Copy link
Collaborator

weisdd commented Aug 9, 2024

@bukovjanmic

  1. Install golang 1.22+ if needed;
  2. Run kubectl port-forward pods/<pod-name> 8888:8888
  3. go tool pprof -pdf http://localhost:8888/debug/pprof/heap > heap.pdf
  4. go tool pprof -pdf http://localhost:8888/debug/pprof/goroutine > goroutine.pdf
  5. Ideally, a graph with memory consumption where we can see both overall and residential usage (RSS)
  6. Also, if you collect metrics from the operator itself, a graph with go_goroutines would be handy
  7. Insights on number and type of CRs that were deployed at the time when you captured the data. Extra information like whether it's in-cluster or external grafana instances, source for dashboards (e.g. whether they're fetched from URLs), etc is appreciated.

Thanks!

@bukovjanmic
Copy link
Author

Here are pprof dumps from our test cluster. This instance of Grafana operator has 5 Grafana CR, 48 GrafanaDashboards, 20 GrafanaDatasources, 1 GrafanaFolder.
grafana-operator-pprof.zip

@weisdd
Copy link
Collaborator

weisdd commented Aug 12, 2024

@bukovjanmic in this profiling set, seems like most of the memory is occupied for caching configmaps / secrets. I would be interesting to see how memory is used in production.

@bukovjanmic
Copy link
Author

Here are pprof files for our production cluster - 33 Grafana CR, 187 GrafanaDashboards, 240 GrafanaDatasources, 2 GrafanaFolders.

grafana-operator-pprof-prod.zip

@theSuess
Copy link
Member

theSuess commented Sep 3, 2024

Looks like the RabbitMQ operator had a similar issue in the past. I'll try to adapt their solution to our operator

rabbitmq/cluster-operator#1549

@weisdd
Copy link
Collaborator

weisdd commented Sep 3, 2024

The good thing is that it's not a memory leak, but just part of the "normal" operations. Would be nice to tune the caching for sure.

Copy link

github-actions bot commented Oct 4, 2024

This issue hasn't been updated for a while, marking as stale, please respond within the next 7 days to remove this label

@github-actions github-actions bot added the stale label Oct 4, 2024
@theSuess theSuess removed the stale label Oct 4, 2024
Copy link

github-actions bot commented Nov 4, 2024

This issue hasn't been updated for a while, marking as stale, please respond within the next 7 days to remove this label

@github-actions github-actions bot added the stale label Nov 4, 2024
@theSuess theSuess added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed stale labels Nov 4, 2024
@theSuess theSuess removed the triage/needs-information Indicates an issue needs more information in order to work on it. label Nov 11, 2024
@Rohlik
Copy link

Rohlik commented Dec 13, 2024

We have EKS cluster with just 35 GrafanaDashboards and 1 local Grafana configured as external within grafana-operator. No other CR are in use. It is worth mentioning that the operator is set to look for CR across all namespaces (116).
The memory consumption, especially at the beginning of the pod lifecycle, is pretty high - 1.68 GiB 🧐. That's quite a lot just for syncing 35 objects 🙄.
Attaching pprof output pprof-prod.zip.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants