-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple strategy with OOMKill data #257
Conversation
Thanks, nice work with finding the right metric. Two major comments so far:
I will complete a more detailed review once we fix those two items. |
I'm still getting the above error, even on latest. Let me know when it is fixed and I will test again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extremely nice work. Left minor comments.
I think it's fine. We want to draw attention to it.
…On Mon, Apr 29, 2024, 11:04 Pavel Zhukov ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In robusta_krr/strategies/simple.py
<#257 (comment)>:
> @@ -101,6 +136,17 @@ def __calculate_memory_proposal(
self, history_data: MetricsPodData, object_data: K8sObjectData
) -> ResourceRecommendation:
data = history_data["MaxMemoryLoader"]
+ info: list[str] = []
+
+ if self.settings.use_oomkill_data:
+ max_oomkill_data = history_data["MaxOOMKilledMemoryLoader"]
+ max_oomkill_value = (
+ np.max([values[0, 1] for values in max_oomkill_data.values()]) if len(max_oomkill_data) > 0 else 0
+ )
+ if max_oomkill_value != 0:
+ info.append("OOMKill detected")
Should it be? It will look like an error, and it is really not
—
Reply to this email directly, view it on GitHub
<#257 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADYUB4BZ5KNPCLGIFTKY43Y7X5I3AVCNFSM6AAAAABFW7HSYKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMRXG44TSNRXGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
From the testing I did I desided to create a separate strategy, that might replace the default one in the future. Motivation is:
kube_pod_container_status_terminated_reason
that should give a point in time when the container is terminated, but because OOMKilled event is normally really fast, it does not end up in this metric, so we need to usekube_pod_container_status_last_terminated_reason
, and with that a new edge case might appear: a user might change a memory limit after OOMKilled (note: changing limits should also require a new termination, but we need to test if there is something we are missing)One possibility is to try to use some prometheus windows, but I feel like that will just add more bugs, while not fact that fix something (also performance considerations)
But from my current tests, current OOMKill metric is doing it's job. But, still, I propose to first give it some time and ask people to test how it works.