Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-21046: Alerts for tenant nearing OOM #172

Merged
merged 3 commits into from
Dec 7, 2023

Conversation

ludydoo
Copy link
Contributor

@ludydoo ludydoo commented Dec 1, 2023

Adding alerts for tenant containers that are about to oom

@ludydoo ludydoo requested a review from a team as a code owner December 1, 2023 16:03
@ludydoo ludydoo requested review from stehessel and removed request for a team December 1, 2023 16:03
@stehessel
Copy link
Contributor

Have you checked how close we are too hitting these alerts on prod right now? I've observed high memory usage on scanner pods in particular. See also https://redhat-internal.slack.com/archives/C0313JYKH8W/p1700683356311839?thread_ts=1700679380.899679&cid=C0313JYKH8W. I'm afraid this will be a very busy alert as is.

- name: tenant-resources
rules:
- expr: |
sum(container_memory_max_usage_bytes{namespace=~"rhacs-.{20}",container!="POD",container!=""}) by (namespace, container, pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sum(container_memory_max_usage_bytes{namespace=~"rhacs-.{20}",container!="POD",container!=""}) by (namespace, container, pod)
sum(container_memory_working_set_bytes{namespace=~"rhacs-.{20}",container!="POD",container!=""}) by (namespace, container, pod)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I've checked if we are hitting the alerts, and we are only for 1 central instance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that means the alert will fire immediately? I think we need to make sure this alert is not noisy, so we need a memory buffer on all Centrals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added resources to the 1 central that would've triggered the alert

Copy link
Contributor

@stehessel stehessel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep an eye on this alert and tighten it if it gets noisy.

@ludydoo ludydoo merged commit 4f1d0c9 into master Dec 7, 2023
1 check passed
@ludydoo ludydoo deleted the ROX-21046-alert-for-tenant-oom branch December 7, 2023 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants