etcd timeout errors on DinD setup using Prow #1922

ormergi · 2020-11-11T10:05:48Z

What happened:
We use KinD with DinD setup using Prow on our CI environment for end-to-end tests with Kubevirt.
And occasionally encounter etcdserver: request timed out errors when we try to create
a cluster object (e.g: CSR, Secret, Kubevirt VM)

We tried what suggested on this issue #717 which also suggests to increase fs.inotify.max_user_watches, but it didnt worked for us and we still see those errors.

I understand this is probably not a very actionable bug report,
we try to understand what is the root cause of this.

What you expected to happen:
No etcd timeout errors.

How to reproduce it (as minimally and precisely as possible):
Its pretty hard to reproduce it manually, but we do see it happen on 50% of the prow jobs.

Anything else we need to know?:

On etcd pods logs we see ..took to long to execute warning all over consistently.
We run KinD clusters on DinD setup using Prow following this document how to run kind in a kubernetes pod #303
A prow job runs install sriov-network-operator and kubevirt operator and runs tests.

Environment:

kind version: (use kind version): kind v0.7.0 go1.13.6 linux/amd64
Kubernetes version: (use kubectl version):

Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2020-01-14T00:09:19Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}

Docker version: (use docker info):
We use DinD setup:
Docker version on the host:

Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: systemd
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Init Binary: /usr/libexec/docker/docker-init-current
containerd version:  (expected: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1)
runc version: 66aedde759f33c190954815fb765eedc1d782dd9 (expected: 9df8b306d01f59d3a8029be411de015b7304dd8f)
init version: fec3683b971d9c3ef73f284f176672c44b448662 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  WARNING: You're not using the default seccomp profile
  Profile: /etc/docker/seccomp.json
 selinux
Kernel Version: 3.10.0-1127.19.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 3
CPUs: 16
Total Memory: 110 GiB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-ip6tables is disabled
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Registries: docker.io (secure)

Docker version on prow job pod:

Server Version: 18.09.6
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-1127.19.1.el7.x86_64
Operating System: Debian GNU/Linux 9 (stretch) (containerized)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 110GiB
Name: e902832b-23f7-11eb-863d-0a580a830d37
ID: OKCK:VVFR:JZS4:3X2Q:56C7:Z76F:23LS:2P64:R4E2:YAHD:NO5F:J5BQ
Docker Root Dir: /docker-graph
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 docker-mirror.kubevirt-prow.svc:5000
 127.0.0.0/8
Registry Mirrors:
 http://docker-mirror.kubevirt-prow.svc:5000/
Live Restore Enabled: false
Product License: Community Engine

WARNING: bridge-nf-call-iptables is disabled

OS (e.g. from /etc/os-release):
Host: Centos 7 3.10.0-1127.19.1.el7.x86_64

The text was updated successfully, but these errors were encountered:

oshoval · 2020-11-11T10:09:02Z

Please see
kubernetes/kubernetes#70082 (comment)
and
kubevirt/kubevirt#4519 (comment)
which is the issue in our repo.

I suspect our HD is too slow for etcd,
#717 as well suggests to check metrics in some of the comments.

BenTheElder · 2020-11-12T21:49:01Z

you can try the hacks outlined in #845

you should also check #303 as a general rule when trying to do kind in kubernetes, but disk speed is probably just your host. etcd is pretty I/O bound.

BenTheElder · 2020-11-13T02:03:49Z

Note that kind's own CI runs in DIND on Prow. We run on fast GCE PD SSD nodes though (because all of CI does, for better build performance etc.).

Based on related issues: https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-apply-entries-took-too-long-mean

Even if your disk is itself fast enough for etcd., it may not be fast enough if you run N binpacked kind clusters.
There's not a lot actionable for us here, we have an existing issue about providing in-memory as an option (with associated tradeoffs, see previous comment links).

Feel free to continue discussing, but we'll use the other issue to continue tracking "faster but less safe etcd for CI".

ormergi · 2020-11-15T18:57:29Z

Hi Ben,
Thanks so much for responding :)

The disks on our CI nodes running healthy, no SSD though.
Is it possible to fetch etcd metrics directly form etcd pods so we could get more details?

you can try the hacks outlined in #845

This is great! we are defiantly going to try that.
So if I understand correctly we need to patch kind-config.yaml and mount in memory emptyDir on the Prow job pod yaml?
Is there something we need to configure on the host side?

you should also check #303 as a general rule when trying to do kind in kubernetes, but disk speed is probably just your host. etcd is pretty I/O bound.

Yep, we dont nests Kind clusters, everything run on top of Openshift cluster using Prow

Note that kind's own CI runs in DIND on Prow. We run on fast GCE PD SSD nodes though (because all of CI does, for better build performance etc.).

Is there a job on KIND e2e that runs with etcd in-memory?

Based on related issues: https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-apply-entries-took-too-long-mean

Even if your disk is itself fast enough for etcd., it may not be fast enough if you run N binpacked kind clusters.

Does it mean that even if we use SSD on our CI nodes we could still encounter those errors,
because we run DinD pod inside a K8S cluster (Openshift in our case)?

There's not a lot actionable for us here, we have an existing issue about providing in-memory as an option (with associated tradeoffs, see previous comment links).

Feel free to continue discussing, but we'll use the other issue to continue tracking "faster but less safe etcd for CI".

BenTheElder · 2020-11-15T19:10:12Z

The disks on our CI nodes running healthy, no SSD though.

Healthy or not, they may not have enough IOPS / throughput for N clusters / builds / ... at once. This is a fairly common problem with prow, versus a build environment where you have one job to a machine (e.g. one jenkins runner per VM), when you bin-pack jobs/pods you can allocate for CPU, RAM, disk space, ... but not I/O.

Is it possible to fetch etcd metrics directly form etcd pods so we could get more details?

You should be able to curl the metrics endpoint I think.

So if I understand correctly we need to patch kind-config.yaml and mount in memory emptyDir on the Prow job pod yaml?

That, or manage a tmpfs in your script (in place of the emptyDir mount). There should be samples discussed in the issue.

Yep, we dont nests Kind clusters, everything run on top of Openshift cluster using Prow

You're nesting kind within openshift=kubernetes though, which has the problems in #303.

Is there a job on KIND e2e that runs with etcd in-memory?

No, not currently. A lot of kind jobs run one to a machine though because they requesting > N/2 CPU (for Kubernetes build purposes, not needed by kind itself).

Does it mean that even if we use SSD on our CI nodes we could still encounter those errors,
because we run DinD pod inside a K8S cluster (Openshift in our case)?

Well, even on the fastest disk in the world if you try to run ♾️ etcd clusters on one disk you're going to run out of I/O eventually.
Moving to tmpfs/memory shifts the problem around but bandwidth / I/O isn't unlimited there either.

dind isn't related, just the N to a disk issue.

BenTheElder · 2020-11-16T19:46:53Z

Should add: if in-memory etcd is the successful route for you, please let us know. We're considering what a native feature for this and other "no persistence but faster" storage hacks might look like.

Currently we encounter bad performance of etcd on sriov provider cluster on DinD setup. We get 'etcdserver: timeout errors' that causes jobs to fail often. In such cases it is recommanded [1] to use in-memory etcd Running etcd in memory should improve performance and will make sriov provider more stabilized. [1] kubernetes-sigs/kind#1922 Signed-off-by: Or Mergi <[email protected]>

Currently we encounter bad performance of etcd on sriov provider cluster on DinD setup. We get 'etcdserver: timeout errors' that causes jobs to fail often. In cases where etcd has bad performance and the data shouldnt be persistant (e.g: on CI and dev environments) it is recommanded [1] to use in-memory etcd To do that this commit: - Adds kubeadm ClusterConfiguration to kind config, setting etcd data directory to '/tmp/...' inside kind cluster nodes. '/tmp/' directory is already mounted to RAM memory as tmpfs. - 'KUBEVIRT_WITH_KIND_ETCD_IN_MEMORY', expected variables: "true", "false" controls running etcd in memory on kind providers. - 'ETD_DATA_DIR', expects directory path that mounted to RAM memory controls the path of etcd data directory inside kind cluster nodes Running etcd in memory should improve performance and will stabilize sriov provider and lanes. [1] kubernetes-sigs/kind#1922 Signed-off-by: Or Mergi <[email protected]>

* kind common, untangle kind config yaml preparation On kind cluster creation We pass yaml file that represents the cluster we want to create. The file is located at kind/manifests/kind.yaml we make a copy, do the changes and passing it tp kind binary At the current state it is not simple to add more changes to kind config yaml file, and its hard to predict how the file will look like at the end. This commit intent is to untange the way we change kind-config.yaml by doing all the changes form one place, making it easier to maintain. - '_prepare_kind_config()' Handels the logic for prepraring kind config yaml file and prints the final config for better visability. - '_add_workers()' Iterates '$KUBEVIRT_NUM_NODES-1' and for each iteration and appends all necessary configuration for each worker. - _add_worker_kubeadm_config_patch Appends kubeadmConfigPatch for a worker node Signed-off-by: Or Mergi <[email protected]> * sriov provider.sh, move kind config changes to kind/common.sh Make use of 'kind_up' Signed-off-by: Or Mergi <[email protected]> * kind infra, run etcd in memory Currently we encounter bad performance of etcd on sriov provider cluster on DinD setup. We get 'etcdserver: timeout errors' that causes jobs to fail often. In cases where etcd has bad performance and the data shouldnt be persistant (e.g: on CI and dev environments) it is recommanded [1] to use in-memory etcd To do that this commit: - Adds kubeadm ClusterConfiguration to kind config, setting etcd data directory to '/tmp/...' inside kind cluster nodes. '/tmp/' directory is already mounted to RAM memory as tmpfs. - 'KUBEVIRT_WITH_KIND_ETCD_IN_MEMORY', expected variables: "true", "false" controls running etcd in memory on kind providers. - 'ETD_DATA_DIR', expects directory path that mounted to RAM memory controls the path of etcd data directory inside kind cluster nodes Running etcd in memory should improve performance and will stabilize sriov provider and lanes. [1] kubernetes-sigs/kind#1922 Signed-off-by: Or Mergi <[email protected]>

ormergi · 2020-12-14T18:12:06Z

Hi @BenTheElder
After we run with our KIND setup on CI with etcd in memory I can confirm that there is a significant improvement in the cluster overall performance and the time it takes for us to create it, etcd is healthy and we no longer see the ..operation took too long warnings in the logs.
Please let me know if I could help somehow with the native feature for this that you mentioned.

I would like to thank you I appreciate the support and quick response!!! 😁 🚀

ormergi added the kind/bug Categorizes issue or PR as related to a bug. label Nov 11, 2020

ormergi mentioned this issue Nov 11, 2020

ectd timeouts erros on tests kubevirt/kubevirt#4529

Closed

BenTheElder added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Nov 12, 2020

BenTheElder changed the title ~~etcd timeout errors on DinD setups using Prow~~ etcd timeout errors on DinD setup using Prow Nov 13, 2020

BenTheElder closed this as completed Nov 13, 2020

BenTheElder self-assigned this Nov 13, 2020

This was referenced Nov 15, 2020

SRIOV lane: Mount in memory directory for backing in-memory etcd data kubevirt/project-infra#709

Closed

KIND infra, SRIOV proivder, Run in-memory etcd kubevirt/kubevirtci#478

Merged

oshoval mentioned this issue Feb 10, 2021

sriov, Let SyncVMI DeadlineExceeded warning be non fatal in tests kubevirt/kubevirt#5002

Merged

ormergi mentioned this issue Jul 14, 2021

Re-attach SRIOV devices to target VM when live migration fails or is canceled kubevirt/kubevirt#5922

Closed

irbekrm mentioned this issue Jun 6, 2022

e2e flake: etcd request slowness cert-manager/cert-manager#5182

Closed

abayer mentioned this issue Jul 11, 2022

Sporadic failures with etcdserver: request timed out on Kind Prow tests tektoncd/pipeline#5118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd timeout errors on DinD setup using Prow #1922

etcd timeout errors on DinD setup using Prow #1922

ormergi commented Nov 11, 2020 •

edited

Loading

oshoval commented Nov 11, 2020 •

edited

Loading

BenTheElder commented Nov 12, 2020

BenTheElder commented Nov 13, 2020

ormergi commented Nov 15, 2020

BenTheElder commented Nov 15, 2020

BenTheElder commented Nov 16, 2020

ormergi commented Dec 14, 2020

etcd timeout errors on DinD setup using Prow #1922

etcd timeout errors on DinD setup using Prow #1922

Comments

ormergi commented Nov 11, 2020 • edited Loading

oshoval commented Nov 11, 2020 • edited Loading

BenTheElder commented Nov 12, 2020

BenTheElder commented Nov 13, 2020

ormergi commented Nov 15, 2020

BenTheElder commented Nov 15, 2020

BenTheElder commented Nov 16, 2020

ormergi commented Dec 14, 2020

ormergi commented Nov 11, 2020 •

edited

Loading

oshoval commented Nov 11, 2020 •

edited

Loading