-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPIKE] Research why Epiphany nodes hang when memory is overcommited #1908
Comments
I was basically unable to reproduce the issue in a generic way. I tried Ubuntu and JVM versions exactly as on the faulty environment:
Epiphany's Kubernetes behaved well in every attempt to crash node VMs. I think the next step would be to work closely with the team that experienced the issue and carefuly observe what their software does. 🤔 @to-bar thank you for very nice memory stress testing toolset! 👍😍 |
@sk4zuzu what is next step here in this task? Why is it blocked? |
I got vpn access, I'm looking at the cluster where it all happened, because "generic" tests showed no issues. I can move it to in progress now I think and continue examining the cluster. |
Tested |
Is your feature request related to a problem? Please describe.
It seems that when k8s resource limits are incorrectly defined and memory is overcommited by applications user loses all ssh access to Epiphany VMs. This indicates that operating system is not protected from memory and cpu overusage. Usually it's not possible to debug such machines, restart is required and cluster becomes unstable. Since OS is not protected there is also security risk of possible denial-of-service attacks. 😱
I suppose it's likely systemd slices are not correctly configured on Epiphany VMs in 0.8 release (at least).
Describe the solution you'd like
What we need to do in this spike is decide what is the culprit here and research ways to prevent such situations in the future. We need confirmation also that this situation happens.
Describe alternatives you've considered
None.
Additional context
Logs found on affected nodes:
The text was updated successfully, but these errors were encountered: