Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPIKE] Research why Epiphany nodes hang when memory is overcommited #1908

Closed
sk4zuzu opened this issue Dec 11, 2020 · 4 comments
Closed

[SPIKE] Research why Epiphany nodes hang when memory is overcommited #1908

sk4zuzu opened this issue Dec 11, 2020 · 4 comments
Assignees
Labels
Milestone

Comments

@sk4zuzu
Copy link
Contributor

sk4zuzu commented Dec 11, 2020

Is your feature request related to a problem? Please describe.
It seems that when k8s resource limits are incorrectly defined and memory is overcommited by applications user loses all ssh access to Epiphany VMs. This indicates that operating system is not protected from memory and cpu overusage. Usually it's not possible to debug such machines, restart is required and cluster becomes unstable. Since OS is not protected there is also security risk of possible denial-of-service attacks. 😱

I suppose it's likely systemd slices are not correctly configured on Epiphany VMs in 0.8 release (at least).

Describe the solution you'd like
What we need to do in this spike is decide what is the culprit here and research ways to prevent such situations in the future. We need confirmation also that this situation happens.

Describe alternatives you've considered
None.

Additional context
Logs found on affected nodes:

Killed process 13428 (java) total-vm:14218144kB, anon-rss:2096560kB, file-rss:3684kB, shmem-rss:0kB, UID:0 pgtables:25228kB oom_score_adj:998
kernel: [200403.323264] memory: usage 3145696kB, limit 3145728kB, failcnt 17191
dockerd[14005]: time="2020-11-25T21:40:40.609327299Z" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."
 kernel: [ 0.289359] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
@sk4zuzu
Copy link
Contributor Author

sk4zuzu commented Dec 23, 2020

I was basically unable to reproduce the issue in a generic way. I tried Ubuntu and JVM versions exactly as on the faulty environment:

  storage_image_reference:
    publisher: Canonical
    offer: UbuntuServer
    sku: 18.04-LTS
    version: 18.04.202006101
openjdk version "11.0.5" 2019-10-15
OpenJDK Runtime Environment (build 11.0.5+10-alpine-r0)
OpenJDK 64-Bit Server VM (build 11.0.5+10-alpine-r0, mixed mode)

Epiphany's Kubernetes behaved well in every attempt to crash node VMs. I think the next step would be to work closely with the team that experienced the issue and carefuly observe what their software does. 🤔

@to-bar thank you for very nice memory stress testing toolset! 👍😍

@mkyc mkyc modified the milestones: S20201231, S20210114 Jan 4, 2021
@mkyc
Copy link
Contributor

mkyc commented Jan 14, 2021

@sk4zuzu what is next step here in this task? Why is it blocked?

@mkyc mkyc modified the milestones: S20210114, S20210128 Jan 14, 2021
@sk4zuzu
Copy link
Contributor Author

sk4zuzu commented Jan 14, 2021

I got vpn access, I'm looking at the cluster where it all happened, because "generic" tests showed no issues. I can move it to in progress now I think and continue examining the cluster.

@przemyslavic przemyslavic self-assigned this Jan 27, 2021
@mkyc mkyc modified the milestones: S20210128, S20210211 Jan 28, 2021
@przemyslavic
Copy link
Collaborator

przemyslavic commented Jan 29, 2021

Tested
epicli apply
epicli upgrade from 0.6, 0.7 HA, 0.8 to develop
After patching cgroup driver (switching from cgroupfs to systemd) it looks to be working properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants