You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi. This is my first post , I don't know if I am doing right, sorry case not.
So I am using a fork of tack in my productions envs. I would like to share some import points for us that I found.
1 - Disable auto updates(coreos).
Today coreos and kubernetes are not synchronized this means the nodes are rebooted and you have errors (until the health checks fails and the pods are reallocated, also in ingress) or downtime. coreos/bugs#1274
units:
- name: update-engine.service
mask: true
- name: locksmithd.service
mask: true
Or try use this guy(I need to test too) https://github.com/coreos/container-linux-update-operator (ref on bug above)
2- Very Important, cause a lot of issues.
Is normal and happen a lot "coredumps". The problem is, the default action of systemd is
get the dump and compress, when this happen the CPU of the machine fully consumed, causing the degradation of everything that is running in the same node, this was the main cause of our first downtime on k8s,"some coredumps were generated by some bad threads, make the cpu achieve 100%, making the health check fails for all other pods and making the pods be killed, etc,etc.
So basically we disable coredump to be save to the disk (only log)
/etc/systemd/coredump.conf
Storage=none
Running
hyperkube-tag = "v1.5.4_coreos.0"
The text was updated successfully, but these errors were encountered:
Hi. This is my first post , I don't know if I am doing right, sorry case not.
So I am using a fork of tack in my productions envs. I would like to share some import points for us that I found.
1 - Disable auto updates(coreos).
Today coreos and kubernetes are not synchronized this means the nodes are rebooted and you have errors (until the health checks fails and the pods are reallocated, also in ingress) or downtime.
coreos/bugs#1274
units:
- name: update-engine.service
mask: true
- name: locksmithd.service
mask: true
Or try use this guy(I need to test too)
https://github.com/coreos/container-linux-update-operator (ref on bug above)
2- Very Important, cause a lot of issues.
Is normal and happen a lot "coredumps". The problem is, the default action of systemd is
get the dump and compress, when this happen the CPU of the machine fully consumed, causing the degradation of everything that is running in the same node, this was the main cause of our first downtime on k8s,"some coredumps were generated by some bad threads, make the cpu achieve 100%, making the health check fails for all other pods and making the pods be killed, etc,etc.
So basically we disable coredump to be save to the disk (only log)
/etc/systemd/coredump.conf
Storage=none
Running
The text was updated successfully, but these errors were encountered: