[SPIKE] Research why Epiphany nodes hang when memory is overcommited #1908

sk4zuzu · 2020-12-11T15:00:00Z

Is your feature request related to a problem? Please describe.
It seems that when k8s resource limits are incorrectly defined and memory is overcommited by applications user loses all ssh access to Epiphany VMs. This indicates that operating system is not protected from memory and cpu overusage. Usually it's not possible to debug such machines, restart is required and cluster becomes unstable. Since OS is not protected there is also security risk of possible denial-of-service attacks. 😱

I suppose it's likely systemd slices are not correctly configured on Epiphany VMs in 0.8 release (at least).

Describe the solution you'd like
What we need to do in this spike is decide what is the culprit here and research ways to prevent such situations in the future. We need confirmation also that this situation happens.

Describe alternatives you've considered
None.

Additional context
Logs found on affected nodes:

Killed process 13428 (java) total-vm:14218144kB, anon-rss:2096560kB, file-rss:3684kB, shmem-rss:0kB, UID:0 pgtables:25228kB oom_score_adj:998

kernel: [200403.323264] memory: usage 3145696kB, limit 3145728kB, failcnt 17191

dockerd[14005]: time="2020-11-25T21:40:40.609327299Z" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap."

 kernel: [ 0.289359] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]

The text was updated successfully, but these errors were encountered:

sk4zuzu · 2020-12-23T09:27:28Z

I was basically unable to reproduce the issue in a generic way. I tried Ubuntu and JVM versions exactly as on the faulty environment:

  storage_image_reference:
    publisher: Canonical
    offer: UbuntuServer
    sku: 18.04-LTS
    version: 18.04.202006101

openjdk version "11.0.5" 2019-10-15
OpenJDK Runtime Environment (build 11.0.5+10-alpine-r0)
OpenJDK 64-Bit Server VM (build 11.0.5+10-alpine-r0, mixed mode)

Epiphany's Kubernetes behaved well in every attempt to crash node VMs. I think the next step would be to work closely with the team that experienced the issue and carefuly observe what their software does. 🤔

@to-bar thank you for very nice memory stress testing toolset! 👍😍

mkyc · 2021-01-14T13:31:07Z

@sk4zuzu what is next step here in this task? Why is it blocked?

sk4zuzu · 2021-01-14T13:53:14Z

I got vpn access, I'm looking at the cluster where it all happened, because "generic" tests showed no issues. I can move it to in progress now I think and continue examining the cluster.

przemyslavic · 2021-01-29T08:49:52Z

Tested
✔ epicli apply
✔ epicli upgrade from 0.6, 0.7 HA, 0.8 to develop
After patching cgroup driver (switching from cgroupfs to systemd) it looks to be working properly.

sk4zuzu added status/grooming-needed type/spike priority/high Task with high priority labels Dec 11, 2020

to-bar mentioned this issue Dec 11, 2020

Add k8s memory dev utils #1909

Merged

plirglo added this to the S20201231 milestone Dec 17, 2020

erzetpe added type/bug and removed status/grooming-needed labels Dec 18, 2020

sk4zuzu self-assigned this Dec 18, 2020

mkyc modified the milestones: S20201231, S20210114 Jan 4, 2021

mkyc modified the milestones: S20210114, S20210128 Jan 14, 2021

przemyslavic self-assigned this Jan 27, 2021

mkyc modified the milestones: S20210128, S20210211 Jan 28, 2021

mkyc closed this as completed Feb 2, 2021

atsikham mentioned this issue Apr 16, 2021

[0.9][Backport] Patching cgroup drivers #2217

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPIKE] Research why Epiphany nodes hang when memory is overcommited #1908

[SPIKE] Research why Epiphany nodes hang when memory is overcommited #1908

sk4zuzu commented Dec 11, 2020 •

edited by erzetpe

Loading

sk4zuzu commented Dec 23, 2020

mkyc commented Jan 14, 2021

sk4zuzu commented Jan 14, 2021 •

edited

Loading

przemyslavic commented Jan 29, 2021 •

edited

Loading

[SPIKE] Research why Epiphany nodes hang when memory is overcommited #1908

[SPIKE] Research why Epiphany nodes hang when memory is overcommited #1908

Comments

sk4zuzu commented Dec 11, 2020 • edited by erzetpe Loading

sk4zuzu commented Dec 23, 2020

mkyc commented Jan 14, 2021

sk4zuzu commented Jan 14, 2021 • edited Loading

przemyslavic commented Jan 29, 2021 • edited Loading

sk4zuzu commented Dec 11, 2020 •

edited by erzetpe

Loading

sk4zuzu commented Jan 14, 2021 •

edited

Loading

przemyslavic commented Jan 29, 2021 •

edited

Loading