-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
manager pod has multiple restarts with oomkilled reason #1416
Comments
@vbedida79 is this regression from v0.26.0? Do you have some test case that you are running during those days? |
How is the operator deployed? I had an operator online for 11 hours with a script adding and removing a gpu CR and the memory foot print fluctuated between 45-47MB, but it didn't seem to increase constantly. |
We have observed restarts with 0.26.0, but wonder if it could be due to tls handshake errors- as we didnt increase memory limit for that pod yet. After increasing memory, no restarts but could see the steady increase. |
Deployed operator with operator-sdk on ocp 4.12. Changed the memory limit to 100mb for the pod 4 days ago. It began with using around 50 MB. Over time, it has increased and the current usage is around 105MB. |
do you deploy/undeploy them in a loop? |
No loop. Since increasing memory limit, deployed these jobs once. |
No workloads are running currently. The pod's memory from the time of operator deployed has increased from 50 MB to 100 and its in the consistent average range of ~107MB after that. Is this normal and expected? |
Update: Checked, manager container memory spiked from 50 to 80 when the pod was created. After which it has constant avg of 80-90 MB. |
yes pretty much and we are going to update the limit as well. thanks for checking! |
Got it, thanks. Will this change be included in a specific release or immediate? Currently we use 0.26.1 for OCP 4.12. |
Only in main which will be in 0.27. @tkatila can you create the PR. 100M request 120M limit? |
Sure |
Fixes: intel#1416 Signed-off-by: Tuomas Katila <[email protected]>
is it not possible to modify the bundle before you release? |
@uMartinXu any thoughts? |
I think if there are no other new problems found in 0.26.1, we can add this small change to our bundle and release it. |
@uMartinXu I did run two basic scenarios with the 0.26.1 operator:
The operator has actually now been online for 10 days without restarts. I applied the operator with the yaml deployment files, you mentioned to install it via operator sdk. Maybe that's the difference? I'm not familiar with the sdk. |
Fixes: intel#1416 Signed-off-by: Tuomas Katila <[email protected]>
Summary
The operator's controller-manager pod has multiple restarts with oomkilled termination reason.
Details
The controller manager pod for version 0.26.1 has multiple restarts with oomkilled reason on OCP 4.12. Currently, we have increased the pod's memory limit from 50MB to 100MB. This ceases multiple restarts of the pod.
After memory increase, we observed a slow increase of memory usage over the course of 3 days from- initially from 70 MB to currently at 108 MB. Could there be an internal memory leak possibility?
The pod logs do not show any errors:
I0512 17:44:29.821411 1 reconciler.go:233] "intel-device-plugins-manager: " controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin" SgxDevicePlugin="sgxdeviceplugin-sample" namespace="" name="sgxdeviceplugin-sample" reconcileID=f73e9186-de1a-4b4c-a7c1-50e73a749a63 ="(MISSING)" I0512 17:44:29.821583 1 reconciler.go:233] "intel-device-plugins-manager: " controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin" GpuDevicePlugin="gpudeviceplugin-sample" namespace="" name="gpudeviceplugin-sample" reconcileID=0c1fac7a-7709-4dcd-b123-54f8a453ff4f ="(MISSING)" I0512 17:44:29.821603 1 reconciler.go:233] "intel-device-plugins-manager: " controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" QatDevicePlugin="qatdeviceplugin-sample" namespace="" name="qatdeviceplugin-sample" reconcileID=84c9d221-e5f4-4348-b7da-a2acae338b90 ="(MISSING)" I0512 17:44:29.828747 1 reconciler.go:233] "intel-device-plugins-manager: " controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" QatDevicePlugin="qatdeviceplugin-sample" namespace="" name="qatdeviceplugin-sample" reconcileID=c89e9d61-74b2-40f6-aa6c-3bda3437ce9e ="(MISSING)"
Possible solutions
The text was updated successfully, but these errors were encountered: