manager pod has multiple restarts with oomkilled reason #1416

vbedida79 · 2023-05-12T19:01:35Z

Summary

The operator's controller-manager pod has multiple restarts with oomkilled termination reason.

Details

The controller manager pod for version 0.26.1 has multiple restarts with oomkilled reason on OCP 4.12. Currently, we have increased the pod's memory limit from 50MB to 100MB. This ceases multiple restarts of the pod.
After memory increase, we observed a slow increase of memory usage over the course of 3 days from- initially from 70 MB to currently at 108 MB. Could there be an internal memory leak possibility?
The pod logs do not show any errors:

I0512 17:44:29.821411 1 reconciler.go:233] "intel-device-plugins-manager: " controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin" SgxDevicePlugin="sgxdeviceplugin-sample" namespace="" name="sgxdeviceplugin-sample" reconcileID=f73e9186-de1a-4b4c-a7c1-50e73a749a63 ="(MISSING)" I0512 17:44:29.821583 1 reconciler.go:233] "intel-device-plugins-manager: " controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin" GpuDevicePlugin="gpudeviceplugin-sample" namespace="" name="gpudeviceplugin-sample" reconcileID=0c1fac7a-7709-4dcd-b123-54f8a453ff4f ="(MISSING)" I0512 17:44:29.821603 1 reconciler.go:233] "intel-device-plugins-manager: " controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" QatDevicePlugin="qatdeviceplugin-sample" namespace="" name="qatdeviceplugin-sample" reconcileID=84c9d221-e5f4-4348-b7da-a2acae338b90 ="(MISSING)" I0512 17:44:29.828747 1 reconciler.go:233] "intel-device-plugins-manager: " controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" QatDevicePlugin="qatdeviceplugin-sample" namespace="" name="qatdeviceplugin-sample" reconcileID=c89e9d61-74b2-40f6-aa6c-3bda3437ce9e ="(MISSING)"

Possible solutions

Is the memory increase change efficient ? The root cause could be an internal memory leak in the application.
A temporary solution could be to increase the number of replicas for the pod to avoid overlapping of the pod restarts. Might not address the memory leak though.

The text was updated successfully, but these errors were encountered:

mythi · 2023-05-15T05:31:34Z

@vbedida79 is this regression from v0.26.0? Do you have some test case that you are running during those days?

tkatila · 2023-05-15T17:24:56Z

How is the operator deployed? I had an operator online for 11 hours with a script adding and removing a gpu CR and the memory foot print fluctuated between 45-47MB, but it didn't seem to increase constantly.

vbedida79 · 2023-05-15T17:28:56Z

@vbedida79 is this regression from v0.26.0? Do you have some test case that you are running during those days?

We have observed restarts with 0.26.0, but wonder if it could be due to tls handshake errors- as we didnt increase memory limit for that pod yet. After increasing memory, no restarts but could see the steady increase.
No specific test cases apart from workloads for sgx(sgx sdk demo) and gpu (clinfo) jobs with 0.26.1.

vbedida79 · 2023-05-15T17:30:35Z

How is the operator deployed? I had an operator online for 11 hours with a script adding and removing a gpu CR and the memory foot print fluctuated between 45-47MB, but it didn't seem to increase constantly.

Deployed operator with operator-sdk on ocp 4.12. Changed the memory limit to 100mb for the pod 4 days ago. It began with using around 50 MB. Over time, it has increased and the current usage is around 105MB.
Would deleting CR's cause steady fluctuations? Any other root cause?

mythi · 2023-05-15T17:49:33Z

No specific test cases apart from workloads for sgx(sgx sdk demo) and gpu (clinfo) jobs with 0.26.1.

do you deploy/undeploy them in a loop?

vbedida79 · 2023-05-15T17:56:58Z

No loop. Since increasing memory limit, deployed these jobs once.

vbedida79 · 2023-05-17T16:50:19Z

No workloads are running currently. The pod's memory from the time of operator deployed has increased from 50 MB to 100 and its in the consistent average range of ~107MB after that. Is this normal and expected?

vbedida79 · 2023-05-19T17:44:22Z

Update: Checked, manager container memory spiked from 50 to 80 when the pod was created. After which it has constant avg of 80-90 MB.
Solution was to increase limit from 50 to 100 to avoid restarts. Is the usage same on your environment?

mythi · 2023-05-19T18:14:25Z

Solution was to increase limit from 50 to 100 to avoid restarts. Is the usage same on your environment?

yes pretty much and we are going to update the limit as well. thanks for checking!

vbedida79 · 2023-05-19T18:31:06Z

Got it, thanks. Will this change be included in a specific release or immediate? Currently we use 0.26.1 for OCP 4.12.

mythi · 2023-05-22T04:23:35Z

Only in main which will be in 0.27.

@tkatila can you create the PR. 100M request 120M limit?

tkatila · 2023-05-22T05:13:06Z

@tkatila can you create the PR. 100M request 120M limit?

Sure

Fixes: intel#1416 Signed-off-by: Tuomas Katila <[email protected]>

vbedida79 · 2023-05-22T15:41:24Z

Thanks @mythi @tkatila.
We use 0.26.1 release to publish the certified operator on OpenShift 4.12. We observe frequent restarts with it.
Can the memory increase solution be included in 0.26.1 release or a branch using that?

mythi · 2023-05-22T15:53:07Z

is it not possible to modify the bundle before you release?

vbedida79 · 2023-05-22T16:06:00Z

@uMartinXu any thoughts?

uMartinXu · 2023-05-22T17:18:49Z

I think if there are no other new problems found in 0.26.1, we can add this small change to our bundle and release it.
@chaitanya1731 what do you think of this? :-)

uMartinXu · 2023-05-24T15:40:04Z

@mythi @tkatila Have you gotten the chance to test how much memory the operator actually consumes on Vanilla K8S? I want to check whether it will consume more memory on OCP than Vanilla K8S. Thanks!

tkatila · 2023-05-25T07:15:56Z

@uMartinXu I did run two basic scenarios with the 0.26.1 operator:

Applied and deleted device plugin CRs in a loop
- Memory consumption fluctuated between 38-48MB over ~10 hour duration. No OOM kills.
Applied a set of CRs and let the operator just idle
- Memory consumption started from 43MB and decreased down to 38MB. This was with (I think) three days of idling. No OOM kills.

The operator has actually now been online for 10 days without restarts.

I applied the operator with the yaml deployment files, you mentioned to install it via operator sdk. Maybe that's the difference? I'm not familiar with the sdk.

Fixes: intel#1416 Signed-off-by: Tuomas Katila <[email protected]>

uMartinXu mentioned this issue May 12, 2023

Upstream & OCP Tasks Tracking List intel/intel-technology-enabling-for-openshift#28

Closed

14 tasks

tkatila added a commit to tkatila/intel-device-plugins-for-kubernetes that referenced this issue May 22, 2023

operator: increase memory resources to 100/120Mi

13097ac

Fixes: intel#1416 Signed-off-by: Tuomas Katila <[email protected]>

tkatila mentioned this issue May 22, 2023

operator: increase memory resources to 100/120Mi #1429

Merged

mythi closed this as completed in #1429 May 22, 2023

tkatila added a commit to tkatila/intel-device-plugins-for-kubernetes that referenced this issue Aug 8, 2023

operator: increase memory resources to 100/120Mi

32ed51c

Fixes: intel#1416 Signed-off-by: Tuomas Katila <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

manager pod has multiple restarts with oomkilled reason #1416

manager pod has multiple restarts with oomkilled reason #1416

vbedida79 commented May 12, 2023

mythi commented May 15, 2023

tkatila commented May 15, 2023

vbedida79 commented May 15, 2023

vbedida79 commented May 15, 2023 •

edited

Loading

mythi commented May 15, 2023

vbedida79 commented May 15, 2023

vbedida79 commented May 17, 2023 •

edited

Loading

vbedida79 commented May 19, 2023

mythi commented May 19, 2023

vbedida79 commented May 19, 2023 •

edited

Loading

mythi commented May 22, 2023

tkatila commented May 22, 2023

vbedida79 commented May 22, 2023 •

edited

Loading

mythi commented May 22, 2023

vbedida79 commented May 22, 2023

uMartinXu commented May 22, 2023

uMartinXu commented May 24, 2023

tkatila commented May 25, 2023

manager pod has multiple restarts with oomkilled reason #1416

manager pod has multiple restarts with oomkilled reason #1416

Comments

vbedida79 commented May 12, 2023

Summary

Details

Possible solutions

mythi commented May 15, 2023

tkatila commented May 15, 2023

vbedida79 commented May 15, 2023

vbedida79 commented May 15, 2023 • edited Loading

mythi commented May 15, 2023

vbedida79 commented May 15, 2023

vbedida79 commented May 17, 2023 • edited Loading

vbedida79 commented May 19, 2023

mythi commented May 19, 2023

vbedida79 commented May 19, 2023 • edited Loading

mythi commented May 22, 2023

tkatila commented May 22, 2023

vbedida79 commented May 22, 2023 • edited Loading

mythi commented May 22, 2023

vbedida79 commented May 22, 2023

uMartinXu commented May 22, 2023

uMartinXu commented May 24, 2023

tkatila commented May 25, 2023

vbedida79 commented May 15, 2023 •

edited

Loading

vbedida79 commented May 17, 2023 •

edited

Loading

vbedida79 commented May 19, 2023 •

edited

Loading

vbedida79 commented May 22, 2023 •

edited

Loading