Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: request timed out, possibly due to connection lost #1059

Closed
wking opened this issue Jan 12, 2019 · 3 comments
Closed

etcdserver: request timed out, possibly due to connection lost #1059

wking opened this issue Jan 12, 2019 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@wking
Copy link
Member

wking commented Jan 12, 2019

Version

$ openshift-install version
openshift-install v0.9.1

Platform (aws|libvirt|openstack):

All.

What happened?

In an e2e-aws run mentioned here:

fail [k8s.io/kubernetes/test/e2e/storage/persistent_volumes-local.go:248]: Expected error:
    <*errors.errorString | 0xc4212bc710>: {
        s: "pod Create API error: etcdserver: request timed out, possibly due to connection lost",
    }
    pod Create API error: etcdserver: request timed out, possibly due to connection lost
not to have occurred

What you expected to happen?

No errors due to etcd delays.

How to reproduce it (as minimally and precisely as possible)?

There have been a lot of these in CI recently, although I'm not sure what would have changed. AWS has had a number of performance issues for us today though, including slow resource generation. Maybe our CI disks are just running slower than usual or something?

Anything else we need to know?

Details or a similar issue in the etcd logs:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1054/pull-ci-openshift-installer-master-e2e-aws/2824/artifacts/e2e-aws/pods/kube-system_etcd-member-ip-10-0-13-114.ec2.internal_etcd-member.log.gz | gunzip | grep -B2 -A3 'etcdserver: request timed out' | head -n 9
2019-01-12 01:35:58.954371 I | raft: raft.node: 1b29101e3d7dd22a lost leader bd31e70ef4e40f8b at term 13
2019-01-12 01:35:59.812308 W | etcdserver: timed out waiting for read index response (local node might have slow network)
2019-01-12 01:35:59.812418 W | etcdserver: read-only range request "key:\"/openshift.io/podtemplates\" range_end:\"/openshift.io/podtemplatet\" count_only:true " with result "error:etcdserver: request timed out" took too long (7.336055856s) to execute
2019-01-12 01:35:59.812518 W | etcdserver: read-only range request "key:\"/openshift.io/services/endpoints/kube-system/kube-scheduler\" " with result "error:etcdserver: request timed out" took too long (8.292539027s) to execute
2019-01-12 01:35:59.812576 W | etcdserver: read-only range request "key:\"/openshift.io/pods/openshift-cluster-kube-scheduler-operator/openshift-cluster-kube-scheduler-operator-56f567694-87qpg\" " with result "error:etcdserver: request timed out" took too long (9.082020841s) to execute
2019-01-12 01:35:59.812635 W | etcdserver: read-only range request "key:\"/openshift.io/pods/openshift-cluster-kube-scheduler-operator/openshift-cluster-kube-scheduler-operator-56f567694-87qpg\" " with result "error:etcdserver: request timed out" took too long (9.082939895s) to execute
2019-01-12 01:36:02.056897 I | raft: 1b29101e3d7dd22a [term: 13] ignored a MsgReadIndexResp message with lower term from bd31e70ef4e40f8b [term: 12]
2019-01-12 01:36:02.554105 W | wal: sync duration of 3.599668164s, expected less than 1s
2019-01-12 01:36:03.654282 I | raft: 1b29101e3d7dd22a is starting a new election at term 13

This seems similar to etcd-io/etcd#9464, which talks about ticks for election and pre-voting as potential fixes, and about bumping to 3.4 to get them. Are their plans for bumping the elderly 3.1.14 we use for bootstrap health checks? Or the more respectable 3.3.10 the machine-config operator suggests for the masters? I guess we'd have to bump to 3.4 for pre-voting, since 3.3.10 already contains the backported-to-3.3.x etcd-io/etcd@3282d9070 (which landed in 3.3.3). Or maybe the problem is something else entirely :p.

As a minor pivot, it seems safe enough for us to move up to 3.3.10 to catch up with openshift/machine-config-operator@59f809676.

/kind bug

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 12, 2019
@wking
Copy link
Member Author

wking commented Jan 12, 2019

Looks like 3.4 with pre-voting is still off in the future? So that leaves "don't flake out the masters to cause so many elections"?

@eparis
Copy link
Member

eparis commented Feb 20, 2019

I'm going to close this issue as it seems to likely be etcd, not installer. If we start running into trouble with this again lets open a BZ against the etcd component.

@eparis eparis closed this as completed Feb 20, 2019
@khanthecomputerguy
Copy link

If you are still having this issue. Check out this video. I was able to resolve the issue with these instructions.
https://www.youtube.com/watch?v=EjTzIokJPcI

thunderboltsid added a commit to thunderboltsid/installer that referenced this issue Jan 14, 2022
This is a stopgap solution until openshift is able to merge
the API PR openshift#1059 openshift/api#1059.
thunderboltsid added a commit to nutanix-cloud-native/openshift-installer that referenced this issue Feb 1, 2022
This is a stopgap solution until openshift is able to merge
the API PR openshift#1059 openshift/api#1059.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants