-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd timeout errors on DinD setup using Prow #1922
Comments
Please see I suspect our HD is too slow for etcd, |
Note that kind's own CI runs in DIND on Prow. We run on fast GCE PD SSD nodes though (because all of CI does, for better build performance etc.). Based on related issues: https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-apply-entries-took-too-long-mean Even if your disk is itself fast enough for etcd., it may not be fast enough if you run N binpacked kind clusters. Feel free to continue discussing, but we'll use the other issue to continue tracking "faster but less safe etcd for CI". |
Hi Ben, The disks on our CI nodes running healthy, no SSD though.
This is great! we are defiantly going to try that.
Yep, we dont nests Kind clusters, everything run on top of Openshift cluster using Prow
Is there a job on KIND e2e that runs with etcd in-memory?
Does it mean that even if we use SSD on our CI nodes we could still encounter those errors,
|
Healthy or not, they may not have enough IOPS / throughput for N clusters / builds / ... at once. This is a fairly common problem with prow, versus a build environment where you have one job to a machine (e.g. one jenkins runner per VM), when you bin-pack jobs/pods you can allocate for CPU, RAM, disk space, ... but not I/O.
You should be able to curl the metrics endpoint I think.
That, or manage a tmpfs in your script (in place of the emptyDir mount). There should be samples discussed in the issue.
You're nesting kind within openshift=kubernetes though, which has the problems in #303.
No, not currently. A lot of kind jobs run one to a machine though because they requesting > N/2 CPU (for Kubernetes build purposes, not needed by kind itself).
Well, even on the fastest disk in the world if you try to run ♾️ etcd clusters on one disk you're going to run out of I/O eventually. dind isn't related, just the N to a disk issue. |
Should add: if in-memory etcd is the successful route for you, please let us know. We're considering what a native feature for this and other "no persistence but faster" storage hacks might look like. |
Currently we encounter bad performance of etcd on sriov provider cluster on DinD setup. We get 'etcdserver: timeout errors' that causes jobs to fail often. In such cases it is recommanded [1] to use in-memory etcd Running etcd in memory should improve performance and will make sriov provider more stabilized. [1] kubernetes-sigs/kind#1922 Signed-off-by: Or Mergi <[email protected]>
Currently we encounter bad performance of etcd on sriov provider cluster on DinD setup. We get 'etcdserver: timeout errors' that causes jobs to fail often. In such cases it is recommanded [1] to use in-memory etcd Running etcd in memory should improve performance and will make sriov provider more stabilized. [1] kubernetes-sigs/kind#1922 Signed-off-by: Or Mergi <[email protected]>
Currently we encounter bad performance of etcd on sriov provider cluster on DinD setup. We get 'etcdserver: timeout errors' that causes jobs to fail often. In such cases it is recommanded [1] to use in-memory etcd Running etcd in memory should improve performance and will make sriov provider more stabilized. [1] kubernetes-sigs/kind#1922 Signed-off-by: Or Mergi <[email protected]>
Currently we encounter bad performance of etcd on sriov provider cluster on DinD setup. We get 'etcdserver: timeout errors' that causes jobs to fail often. In such cases it is recommanded [1] to use in-memory etcd Running etcd in memory should improve performance and will make sriov provider more stabilized. [1] kubernetes-sigs/kind#1922 Signed-off-by: Or Mergi <[email protected]>
Currently we encounter bad performance of etcd on sriov provider cluster on DinD setup. We get 'etcdserver: timeout errors' that causes jobs to fail often. In such cases it is recommanded [1] to use in-memory etcd Running etcd in memory should improve performance and will make sriov provider more stabilized. [1] kubernetes-sigs/kind#1922 Signed-off-by: Or Mergi <[email protected]>
Currently we encounter bad performance of etcd on sriov provider cluster on DinD setup. We get 'etcdserver: timeout errors' that causes jobs to fail often. In such cases it is recommanded [1] to use in-memory etcd Running etcd in memory should improve performance and will make sriov provider more stabilized. [1] kubernetes-sigs/kind#1922 Signed-off-by: Or Mergi <[email protected]>
Currently we encounter bad performance of etcd on sriov provider cluster on DinD setup. We get 'etcdserver: timeout errors' that causes jobs to fail often. In such cases it is recommanded [1] to use in-memory etcd Running etcd in memory should improve performance and will make sriov provider more stabilized. [1] kubernetes-sigs/kind#1922 Signed-off-by: Or Mergi <[email protected]>
Currently we encounter bad performance of etcd on sriov provider cluster on DinD setup. We get 'etcdserver: timeout errors' that causes jobs to fail often. In such cases it is recommanded [1] to use in-memory etcd Running etcd in memory should improve performance and will make sriov provider more stabilized. [1] kubernetes-sigs/kind#1922 Signed-off-by: Or Mergi <[email protected]>
Currently we encounter bad performance of etcd on sriov provider cluster on DinD setup. We get 'etcdserver: timeout errors' that causes jobs to fail often. In cases where etcd has bad performance and the data shouldnt be persistant (e.g: on CI and dev environments) it is recommanded [1] to use in-memory etcd To do that this commit: - Adds kubeadm ClusterConfiguration to kind config, setting etcd data directory to '/tmp/...' inside kind cluster nodes. '/tmp/' directory is already mounted to RAM memory as tmpfs. - 'KUBEVIRT_WITH_KIND_ETCD_IN_MEMORY', expected variables: "true", "false" controls running etcd in memory on kind providers. - 'ETD_DATA_DIR', expects directory path that mounted to RAM memory controls the path of etcd data directory inside kind cluster nodes Running etcd in memory should improve performance and will stabilize sriov provider and lanes. [1] kubernetes-sigs/kind#1922 Signed-off-by: Or Mergi <[email protected]>
Currently we encounter bad performance of etcd on sriov provider cluster on DinD setup. We get 'etcdserver: timeout errors' that causes jobs to fail often. In cases where etcd has bad performance and the data shouldnt be persistant (e.g: on CI and dev environments) it is recommanded [1] to use in-memory etcd To do that this commit: - Adds kubeadm ClusterConfiguration to kind config, setting etcd data directory to '/tmp/...' inside kind cluster nodes. '/tmp/' directory is already mounted to RAM memory as tmpfs. - 'KUBEVIRT_WITH_KIND_ETCD_IN_MEMORY', expected variables: "true", "false" controls running etcd in memory on kind providers. - 'ETD_DATA_DIR', expects directory path that mounted to RAM memory controls the path of etcd data directory inside kind cluster nodes Running etcd in memory should improve performance and will stabilize sriov provider and lanes. [1] kubernetes-sigs/kind#1922 Signed-off-by: Or Mergi <[email protected]>
* kind common, untangle kind config yaml preparation On kind cluster creation We pass yaml file that represents the cluster we want to create. The file is located at kind/manifests/kind.yaml we make a copy, do the changes and passing it tp kind binary At the current state it is not simple to add more changes to kind config yaml file, and its hard to predict how the file will look like at the end. This commit intent is to untange the way we change kind-config.yaml by doing all the changes form one place, making it easier to maintain. - '_prepare_kind_config()' Handels the logic for prepraring kind config yaml file and prints the final config for better visability. - '_add_workers()' Iterates '$KUBEVIRT_NUM_NODES-1' and for each iteration and appends all necessary configuration for each worker. - _add_worker_kubeadm_config_patch Appends kubeadmConfigPatch for a worker node Signed-off-by: Or Mergi <[email protected]> * sriov provider.sh, move kind config changes to kind/common.sh Make use of 'kind_up' Signed-off-by: Or Mergi <[email protected]> * kind infra, run etcd in memory Currently we encounter bad performance of etcd on sriov provider cluster on DinD setup. We get 'etcdserver: timeout errors' that causes jobs to fail often. In cases where etcd has bad performance and the data shouldnt be persistant (e.g: on CI and dev environments) it is recommanded [1] to use in-memory etcd To do that this commit: - Adds kubeadm ClusterConfiguration to kind config, setting etcd data directory to '/tmp/...' inside kind cluster nodes. '/tmp/' directory is already mounted to RAM memory as tmpfs. - 'KUBEVIRT_WITH_KIND_ETCD_IN_MEMORY', expected variables: "true", "false" controls running etcd in memory on kind providers. - 'ETD_DATA_DIR', expects directory path that mounted to RAM memory controls the path of etcd data directory inside kind cluster nodes Running etcd in memory should improve performance and will stabilize sriov provider and lanes. [1] kubernetes-sigs/kind#1922 Signed-off-by: Or Mergi <[email protected]>
Hi @BenTheElder I would like to thank you I appreciate the support and quick response!!! 😁 🚀 |
What happened:
We use KinD with DinD setup using Prow on our CI environment for end-to-end tests with Kubevirt.
And occasionally encounter
etcdserver: request timed out
errors when we try to createa cluster object (e.g: CSR, Secret, Kubevirt VM)
We tried what suggested on this issue #717 which also suggests to increase
fs.inotify.max_user_watches
, but it didnt worked for us and we still see those errors.I understand this is probably not a very actionable bug report,
we try to understand what is the root cause of this.
What you expected to happen:
No etcd timeout errors.
How to reproduce it (as minimally and precisely as possible):
Its pretty hard to reproduce it manually, but we do see it happen on 50% of the prow jobs.
Anything else we need to know?:
..took to long to execute
warning all over consistently.Environment:
kind version
): kind v0.7.0 go1.13.6 linux/amd64kubectl version
):docker info
):We use DinD setup:
Docker version on the host:
Docker version on prow job pod:
/etc/os-release
):Host: Centos 7 3.10.0-1127.19.1.el7.x86_64
The text was updated successfully, but these errors were encountered: