-
Notifications
You must be signed in to change notification settings - Fork 24
One of the etcd container got deleted automatically, two of the etcd container status went to completed state #15
Comments
I am able to repro this issue. Second time what i have tried, providing steps below.
Setup went to below state
|
With latest repo, created one pvc then i have rebooted one of the gd2 pod. Two of the etcd container status went to completed state. Then logged in to GD2 pod , tried to perform glustercli commands, GD2 pods went to hung state.
|
I faced the same issue. my pods got in crashLooopbackoff mode
|
logs from etcd container
|
as ETCD containers are going down we are facing issue in glusterd2 containers
|
@kshlmI think we have some issue with ETCD cluster deployed by the operator. any thoughts on this? |
Etcd pods status went to completed state after reboot/delete of GD2 pods or after reboot of kube nodes. kubectl -n gcs get all |
error message is saying trying to start embeded store. Is there a chance glusterd2 pod loosing configuration set via env options(like
|
This is surprising. I need to look into this, and the etcd-operator issues. |
Two pvcs mounted to 40 app pods, started running I/O's on one of the mount point. while I/O's are running on mount point, parallel trying to create 30 more app pods with same two pvc mount point. ETCD pods and glusterd2 pods status went to below state.
|
@rmadaka can you provide some logs from ETCD pods as well, that would be helpful |
Create one pvc and mounte to app pod. Login to gd2 pod and kill the one brick then start running I/O's on mount point. ETCD pods status went to below state.
|
This logs for ETCD pod which is in Error state:
|
This Logs for ETCD pod which is in completed state: 2018-10-24 10:18:56.510531 I | raft: 4c6267e9f13aa2f7 became follower at term 103 |
@atinmu Yes atin, I tried with etcd 3.2.24 as well, i am able hit same issue. |
@Madhu-1 Can we please talk to some of the etcd folks to (through an issue) to understand what's going wrong here? |
@atinmu created ETCD issue for this coreos/etcd-operator#2004 |
@rmadaka are we hitting this issue coreos/etcd-operator#1130? |
some more logs after debugging
|
@Madhu-1 did you get a chance to test out this by increasing the overall space for etcd? Did that make any difference? |
Is this a lack of sufficient CPU resources in this environment (pod or VM)? |
from the infra side we have a plenty of resources available. |
After extending the root space of the host, now am able to create more PVC
|
@atinmu With latest build, easily i am able to reproduce this issue. i will give my setup to madhu |
Adding my observations here, After 2 hours of idle time , i have observed the etcd pods went to completed state.
I have observed etcd logs as well, mostly i have seen below are the log error messages repeatedly coming.
I have observed kubernetes events as well. below are the events observed continuously
I have observed cpu and memory consumption of one of the master node.
|
@rmadaka memory consumption needs to be checked on ETCD pods. |
I see virtual memory size (VSZ) clocking to 558%. This is on idle setup. I need to check how VSZ works. |
Create GCS setup with 16 vcpus and 32GB RAM for each kube node. Then try to create 1000 PVC using script. Each pvc size 1GB. Observation: -> I have tried to create 1000 PVC, 615 PVCs are bounded remaining pvcs are in pending state. -> After 12 hours of idle time observed that GD2 pods rebooted more than 100 times -> After 12 hour of idle time one of the etcd pod is in not ready state.
-> Observed etcd pods memory and cpu utilization as well. Below observation is after 1000 pvc creation, systems were idle for more than 12 hours, then logged into etcd pod and ran top command Note : Below are the only two etcd pods memory consumption only, i was not able to login to one more etcd pod, it’s because of that etcd pod not in ready state.
|
@rmadaka I believe this was due to the constraint on the resource, with 32G RAM and 8 CPU it's not visible any further, correct? |
We did benchmarking around this bug wrt different resources (vcpus and ram) => 2 vcpus & 2gb Also, I believe @rmadaka has see one of the etcd in completed state with 8 CPU and 32G RAM in one scenario. Please confirm, Rajesh. |
One possible issue is filling up etcd space more than https://coreos.com/etcd/docs/latest/dev-guide/limit.html
We need a way to configure more size via etcd-operator or env variable. |
Fast growing etcd size is discussed here etcd-io/etcd#8009 and solution recommended to use latest version of etcd. |
Another thing to consider is the snapshot interval and compaction frequency. It looks like the snapshot interval affects memory usage and storage space will grow indefinitely if compaction isn't enabled, yet for some reason it defaults to off. I've been playing around w/ these features a bit and I don't have anything concrete at this point, but @rmadaka could probably quickly get a comparison. For reference, here's an updated ---
kind: EtcdCluster
apiVersion: etcd.database.coreos.com/v1beta2
metadata:
name: etcd
namespace: {{ gcs_namespace }}
labels:
app.kubernetes.io/part-of: gcs
app.kubernetes.io/component: etcd
app.kubernetes.io/name: etcd-cluster
spec:
pod:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: etcd_cluster
operator: In
values:
- etcd
topologyKey: kubernetes.io/hostname
etcdEnv:
# https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/maintenance.md#auto-compaction
- name: ETCD_AUTO_COMPACTION_MODE
value: "periodic"
- name: ETCD_AUTO_COMPACTION_RETENTION
value: "5m"
- name: ETCD_SNAPSHOT_COUNT
value: "10000"
size: 3
version: 3.3.8
# TODO: Setup backup |
From gcs-etcd-cluster I can see the version is 3.3.8 already? |
Yes. I think we need to try compaction config suggested by @JohnStrunk and increase the etcd storage limit by configuring |
@rmadaka also add the following to your etcd yml file
|
@aravindavk under etcdEnv ? The updated yaml file looks like the below one. Please correct if anything needs to change.
|
LGTM. @Madhu-1 please check |
@atinmu @JohnStrunk @Madhu-1 @aravindavk I have tried with above etcd config file and whatever aravinda mentioned about size, same thing i have incorporated in etcd yaml file. Then i have deployed the gcs setup. I tried for non sequential , My script create pvcs continuously one by one without waiting to get bound previous pvc. Kube node config: 32 gb ram, 8 vcpus With Brick -Multiplex: Script ran for 1000 pvcs, but only 402 pvcs only got bound. after 2 hours of idle time etcd pods went to Error state. after some time one of the kube node went to not ready state. Need to analyze logs for this. Without Brick-Multiplex: Script ran for 1000 PVCs , but only 569 pvcs got bound . but i was not able keep this setup for idle because lack of server space. I am about to try sequential pvc creation (i mean script will wait till pvc to get bound, then it will go for next pvc creation) |
Thanks Rajesh for the update, Below are my observations with brick mux enabled.
During the creation of pvc mulitple gd2 pods reboots were seen. especially kube3 was the first among the 3 to restart first. I see the tcd operator pod getting spin on the different node. |
@atinmu Do we need to carry out the testing with new etcd config suggested by @JohnStrunk & @aravindavk or shall we continue with default config of etcd pods. |
Fixes: #15 Signed-off-by: Atin Mukherjee <[email protected]>
It seems the tuning has helped, many thanks @JohnStrunk . In none of our testing setup we couldn't see etcd pods going into completed state :-) |
Can we make this default in the latest nightly ? |
It seems like one of the etcd container deleted, i am able to see only two etcd ocntainers which are in "completed state" and one etcd operator container which is in running sate.
GD2 containers keep on restarting. i think this because of etcd. Able to see below errors from glusted2.log
time="2018-10-05 14:03:20.249447" level=warning msg="could not read store config file, continuing with defaults" error="open /var/lib/glusterd2/store.toml: no such file or directory" source="[config.go:128:store.GetConfig]"
time="2018-10-05 14:03:25.250568" level=error msg="failed to start embedded store" error="context deadline exceeded" source="[embed.go:36:store.newEmbedStore]"
time="2018-10-05 14:03:25.250669" level=fatal msg="Failed to initialize store (etcd client)" error="context deadline exceeded" source="[main.go:101:main.main]"
~
The text was updated successfully, but these errors were encountered: