Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd timeout errors on DinD setup using Prow #1922

Closed
ormergi opened this issue Nov 11, 2020 · 7 comments
Closed

etcd timeout errors on DinD setup using Prow #1922

ormergi opened this issue Nov 11, 2020 · 7 comments
Assignees
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@ormergi
Copy link

ormergi commented Nov 11, 2020

What happened:
We use KinD with DinD setup using Prow on our CI environment for end-to-end tests with Kubevirt.
And occasionally encounter etcdserver: request timed out errors when we try to create
a cluster object (e.g: CSR, Secret, Kubevirt VM)

We tried what suggested on this issue #717 which also suggests to increase fs.inotify.max_user_watches, but it didnt worked for us and we still see those errors.

I understand this is probably not a very actionable bug report,
we try to understand what is the root cause of this.

What you expected to happen:
No etcd timeout errors.

How to reproduce it (as minimally and precisely as possible):
Its pretty hard to reproduce it manually, but we do see it happen on 50% of the prow jobs.

Anything else we need to know?:

Environment:

  • kind version: (use kind version): kind v0.7.0 go1.13.6 linux/amd64
  • Kubernetes version: (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2020-01-14T00:09:19Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
  • Docker version: (use docker info):
    We use DinD setup:
    Docker version on the host:
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: systemd
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Init Binary: /usr/libexec/docker/docker-init-current
containerd version:  (expected: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1)
runc version: 66aedde759f33c190954815fb765eedc1d782dd9 (expected: 9df8b306d01f59d3a8029be411de015b7304dd8f)
init version: fec3683b971d9c3ef73f284f176672c44b448662 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  WARNING: You're not using the default seccomp profile
  Profile: /etc/docker/seccomp.json
 selinux
Kernel Version: 3.10.0-1127.19.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 3
CPUs: 16
Total Memory: 110 GiB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Registries: docker.io (secure)

Docker version on prow job pod:

Server Version: 18.09.6
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-1127.19.1.el7.x86_64
Operating System: Debian GNU/Linux 9 (stretch) (containerized)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 110GiB
Name: e902832b-23f7-11eb-863d-0a580a830d37
ID: OKCK:VVFR:JZS4:3X2Q:56C7:Z76F:23LS:2P64:R4E2:YAHD:NO5F:J5BQ
Docker Root Dir: /docker-graph
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 docker-mirror.kubevirt-prow.svc:5000
 127.0.0.0/8
Registry Mirrors:
 http://docker-mirror.kubevirt-prow.svc:5000/
Live Restore Enabled: false
Product License: Community Engine

WARNING: bridge-nf-call-iptables is disabled
  • OS (e.g. from /etc/os-release):
    Host: Centos 7 3.10.0-1127.19.1.el7.x86_64
@ormergi ormergi added the kind/bug Categorizes issue or PR as related to a bug. label Nov 11, 2020
@oshoval
Copy link

oshoval commented Nov 11, 2020

Please see
kubernetes/kubernetes#70082 (comment)
and
kubevirt/kubevirt#4519 (comment)
which is the issue in our repo.

I suspect our HD is too slow for etcd,
#717 as well suggests to check metrics in some of the comments.

@BenTheElder BenTheElder added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Nov 12, 2020
@BenTheElder
Copy link
Member

you can try the hacks outlined in #845

you should also check #303 as a general rule when trying to do kind in kubernetes, but disk speed is probably just your host. etcd is pretty I/O bound.

@BenTheElder BenTheElder changed the title etcd timeout errors on DinD setups using Prow etcd timeout errors on DinD setup using Prow Nov 13, 2020
@BenTheElder
Copy link
Member

Note that kind's own CI runs in DIND on Prow. We run on fast GCE PD SSD nodes though (because all of CI does, for better build performance etc.).

Based on related issues: https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-apply-entries-took-too-long-mean

Even if your disk is itself fast enough for etcd., it may not be fast enough if you run N binpacked kind clusters.
There's not a lot actionable for us here, we have an existing issue about providing in-memory as an option (with associated tradeoffs, see previous comment links).

Feel free to continue discussing, but we'll use the other issue to continue tracking "faster but less safe etcd for CI".

@BenTheElder BenTheElder self-assigned this Nov 13, 2020
@ormergi
Copy link
Author

ormergi commented Nov 15, 2020

Hi Ben,
Thanks so much for responding :)

The disks on our CI nodes running healthy, no SSD though.
Is it possible to fetch etcd metrics directly form etcd pods so we could get more details?

you can try the hacks outlined in #845

This is great! we are defiantly going to try that.
So if I understand correctly we need to patch kind-config.yaml and mount in memory emptyDir on the Prow job pod yaml?
Is there something we need to configure on the host side?

you should also check #303 as a general rule when trying to do kind in kubernetes, but disk speed is probably just your host. etcd is pretty I/O bound.

Yep, we dont nests Kind clusters, everything run on top of Openshift cluster using Prow

Note that kind's own CI runs in DIND on Prow. We run on fast GCE PD SSD nodes though (because all of CI does, for better build performance etc.).

Is there a job on KIND e2e that runs with etcd in-memory?

Based on related issues: https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-apply-entries-took-too-long-mean

Even if your disk is itself fast enough for etcd., it may not be fast enough if you run N binpacked kind clusters.

Does it mean that even if we use SSD on our CI nodes we could still encounter those errors,
because we run DinD pod inside a K8S cluster (Openshift in our case)?

There's not a lot actionable for us here, we have an existing issue about providing in-memory as an option (with associated tradeoffs, see previous comment links).

Feel free to continue discussing, but we'll use the other issue to continue tracking "faster but less safe etcd for CI".

@BenTheElder
Copy link
Member

The disks on our CI nodes running healthy, no SSD though.

Healthy or not, they may not have enough IOPS / throughput for N clusters / builds / ... at once. This is a fairly common problem with prow, versus a build environment where you have one job to a machine (e.g. one jenkins runner per VM), when you bin-pack jobs/pods you can allocate for CPU, RAM, disk space, ... but not I/O.

Is it possible to fetch etcd metrics directly form etcd pods so we could get more details?

You should be able to curl the metrics endpoint I think.

So if I understand correctly we need to patch kind-config.yaml and mount in memory emptyDir on the Prow job pod yaml?

That, or manage a tmpfs in your script (in place of the emptyDir mount). There should be samples discussed in the issue.

Yep, we dont nests Kind clusters, everything run on top of Openshift cluster using Prow

You're nesting kind within openshift=kubernetes though, which has the problems in #303.

Is there a job on KIND e2e that runs with etcd in-memory?

No, not currently. A lot of kind jobs run one to a machine though because they requesting > N/2 CPU (for Kubernetes build purposes, not needed by kind itself).

Does it mean that even if we use SSD on our CI nodes we could still encounter those errors,
because we run DinD pod inside a K8S cluster (Openshift in our case)?

Well, even on the fastest disk in the world if you try to run ♾️ etcd clusters on one disk you're going to run out of I/O eventually.
Moving to tmpfs/memory shifts the problem around but bandwidth / I/O isn't unlimited there either.

dind isn't related, just the N to a disk issue.

@BenTheElder
Copy link
Member

Should add: if in-memory etcd is the successful route for you, please let us know. We're considering what a native feature for this and other "no persistence but faster" storage hacks might look like.

ormergi added a commit to ormergi/kubevirtci that referenced this issue Nov 17, 2020
Currently we encounter bad performance of etcd on
sriov provider cluster on DinD setup.
We get 'etcdserver: timeout errors' that causes jobs to
fail often.

In such cases it is recommanded [1] to use in-memory etcd
Running etcd in memory should improve performance and
will make sriov provider more stabilized.

[1] kubernetes-sigs/kind#1922

Signed-off-by: Or Mergi <[email protected]>
ormergi added a commit to ormergi/kubevirtci that referenced this issue Nov 17, 2020
Currently we encounter bad performance of etcd on
sriov provider cluster on DinD setup.
We get 'etcdserver: timeout errors' that causes jobs to
fail often.

In such cases it is recommanded [1] to use in-memory etcd
Running etcd in memory should improve performance and
will make sriov provider more stabilized.

[1] kubernetes-sigs/kind#1922

Signed-off-by: Or Mergi <[email protected]>
ormergi added a commit to ormergi/kubevirtci that referenced this issue Nov 17, 2020
Currently we encounter bad performance of etcd on
sriov provider cluster on DinD setup.
We get 'etcdserver: timeout errors' that causes jobs to
fail often.

In such cases it is recommanded [1] to use in-memory etcd
Running etcd in memory should improve performance and
will make sriov provider more stabilized.

[1] kubernetes-sigs/kind#1922

Signed-off-by: Or Mergi <[email protected]>
ormergi added a commit to ormergi/kubevirtci that referenced this issue Nov 17, 2020
Currently we encounter bad performance of etcd on
sriov provider cluster on DinD setup.
We get 'etcdserver: timeout errors' that causes jobs to
fail often.

In such cases it is recommanded [1] to use in-memory etcd
Running etcd in memory should improve performance and
will make sriov provider more stabilized.

[1] kubernetes-sigs/kind#1922

Signed-off-by: Or Mergi <[email protected]>
ormergi added a commit to ormergi/kubevirtci that referenced this issue Nov 18, 2020
Currently we encounter bad performance of etcd on
sriov provider cluster on DinD setup.
We get 'etcdserver: timeout errors' that causes jobs to
fail often.

In such cases it is recommanded [1] to use in-memory etcd
Running etcd in memory should improve performance and
will make sriov provider more stabilized.

[1] kubernetes-sigs/kind#1922

Signed-off-by: Or Mergi <[email protected]>
ormergi added a commit to ormergi/kubevirtci that referenced this issue Nov 18, 2020
Currently we encounter bad performance of etcd on
sriov provider cluster on DinD setup.
We get 'etcdserver: timeout errors' that causes jobs to
fail often.

In such cases it is recommanded [1] to use in-memory etcd
Running etcd in memory should improve performance and
will make sriov provider more stabilized.

[1] kubernetes-sigs/kind#1922

Signed-off-by: Or Mergi <[email protected]>
ormergi added a commit to ormergi/kubevirtci that referenced this issue Nov 20, 2020
Currently we encounter bad performance of etcd on
sriov provider cluster on DinD setup.
We get 'etcdserver: timeout errors' that causes jobs to
fail often.

In such cases it is recommanded [1] to use in-memory etcd
Running etcd in memory should improve performance and
will make sriov provider more stabilized.

[1] kubernetes-sigs/kind#1922

Signed-off-by: Or Mergi <[email protected]>
ormergi added a commit to ormergi/kubevirtci that referenced this issue Nov 22, 2020
Currently we encounter bad performance of etcd on
sriov provider cluster on DinD setup.
We get 'etcdserver: timeout errors' that causes jobs to
fail often.

In such cases it is recommanded [1] to use in-memory etcd
Running etcd in memory should improve performance and
will make sriov provider more stabilized.

[1] kubernetes-sigs/kind#1922

Signed-off-by: Or Mergi <[email protected]>
ormergi added a commit to ormergi/kubevirtci that referenced this issue Nov 23, 2020
Currently we encounter bad performance of etcd on
sriov provider cluster on DinD setup.
We get 'etcdserver: timeout errors' that causes jobs to
fail often.
In cases where etcd has bad performance and the data shouldnt
be persistant (e.g: on CI and dev environments) it is recommanded [1]
to use in-memory etcd

To do that this commit:
-  Adds kubeadm ClusterConfiguration to kind config, setting
   etcd data directory to '/tmp/...' inside kind cluster nodes.
   '/tmp/' directory is already mounted to RAM memory as tmpfs.

- 'KUBEVIRT_WITH_KIND_ETCD_IN_MEMORY', expected variables: "true", "false"
   controls running etcd in memory on kind providers.

- 'ETD_DATA_DIR', expects directory path that mounted to RAM memory
   controls the path of etcd data directory inside kind cluster nodes

Running etcd in memory should improve performance and
will stabilize sriov provider and lanes.

[1] kubernetes-sigs/kind#1922

Signed-off-by: Or Mergi <[email protected]>
ormergi added a commit to ormergi/kubevirtci that referenced this issue Nov 23, 2020
Currently we encounter bad performance of etcd on
sriov provider cluster on DinD setup.
We get 'etcdserver: timeout errors' that causes jobs to
fail often.
In cases where etcd has bad performance and the data shouldnt
be persistant (e.g: on CI and dev environments) it is recommanded [1]
to use in-memory etcd

To do that this commit:
-  Adds kubeadm ClusterConfiguration to kind config, setting
   etcd data directory to '/tmp/...' inside kind cluster nodes.
   '/tmp/' directory is already mounted to RAM memory as tmpfs.

- 'KUBEVIRT_WITH_KIND_ETCD_IN_MEMORY', expected variables: "true", "false"
   controls running etcd in memory on kind providers.

- 'ETD_DATA_DIR', expects directory path that mounted to RAM memory
   controls the path of etcd data directory inside kind cluster nodes

Running etcd in memory should improve performance and
will stabilize sriov provider and lanes.

[1] kubernetes-sigs/kind#1922

Signed-off-by: Or Mergi <[email protected]>
kubevirt-bot pushed a commit to kubevirt/kubevirtci that referenced this issue Nov 23, 2020
* kind common, untangle kind config yaml preparation

On kind cluster creation We pass yaml file that represents
the cluster we want to create.
The file is located at kind/manifests/kind.yaml we
make a copy, do the changes and passing it tp kind binary
At the current state it is not simple to add more
changes to kind config yaml file, and its hard to predict
how the file will look like at the end.

This commit intent is to untange the way we change kind-config.yaml
by doing all the changes form one place, making it easier to maintain.

- '_prepare_kind_config()'
  Handels the logic for prepraring kind config yaml file
  and prints the final config for better visability.
- '_add_workers()'
  Iterates '$KUBEVIRT_NUM_NODES-1' and for each iteration
  and appends all necessary configuration for each worker.
- _add_worker_kubeadm_config_patch
  Appends kubeadmConfigPatch for a worker node

Signed-off-by: Or Mergi <[email protected]>

* sriov provider.sh, move kind config changes to kind/common.sh

Make use of 'kind_up'

Signed-off-by: Or Mergi <[email protected]>

* kind infra, run etcd in memory

Currently we encounter bad performance of etcd on
sriov provider cluster on DinD setup.
We get 'etcdserver: timeout errors' that causes jobs to
fail often.
In cases where etcd has bad performance and the data shouldnt
be persistant (e.g: on CI and dev environments) it is recommanded [1]
to use in-memory etcd

To do that this commit:
-  Adds kubeadm ClusterConfiguration to kind config, setting
   etcd data directory to '/tmp/...' inside kind cluster nodes.
   '/tmp/' directory is already mounted to RAM memory as tmpfs.

- 'KUBEVIRT_WITH_KIND_ETCD_IN_MEMORY', expected variables: "true", "false"
   controls running etcd in memory on kind providers.

- 'ETD_DATA_DIR', expects directory path that mounted to RAM memory
   controls the path of etcd data directory inside kind cluster nodes

Running etcd in memory should improve performance and
will stabilize sriov provider and lanes.

[1] kubernetes-sigs/kind#1922

Signed-off-by: Or Mergi <[email protected]>
@ormergi
Copy link
Author

ormergi commented Dec 14, 2020

Hi @BenTheElder
After we run with our KIND setup on CI with etcd in memory I can confirm that there is a significant improvement in the cluster overall performance and the time it takes for us to create it, etcd is healthy and we no longer see the ..operation took too long warnings in the logs.
Please let me know if I could help somehow with the native feature for this that you mentioned.

I would like to thank you I appreciate the support and quick response!!! 😁 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

3 participants