Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

openshift: tracker for etcd timeout issues #2918

Closed
jim-minter opened this issue May 10, 2018 · 10 comments
Closed

openshift: tracker for etcd timeout issues #2918

jim-minter opened this issue May 10, 2018 · 10 comments

Comments

@jim-minter
Copy link
Member

No description provided.

@jim-minter
Copy link
Member Author

@Kargakis @pweil-

@jim-minter
Copy link
Member Author

https://circleci.com/gh/Azure/acs-engine/25995

TASK [openshift_hosted : Create OpenShift router] ******************************
failed: [localhost] (item={u'name': u'router', u'certificate': {u'certfile': u'/etc/origin//master/openshift-router.crt', u'keyfile': u'/etc/origin//master/openshift-router.key', u'cafile': u'/etc/origin//master/ca.crt'}, u'replicas': u'1', u'serviceaccount': u'router', u'namespace': u'default', u'stats_port': 1936, u'edits': [{u'action': u'put', u'key': u'spec.strategy.rollingParams.intervalSeconds', u'value': 1}, {u'action': u'put', u'key': u'spec.strategy.rollingParams.updatePeriodSeconds', u'value': 1}, {u'action': u'put', u'key': u'spec.strategy.activeDeadlineSeconds', u'value': 21600}], u'images': u'registry.access.redhat.com/openshift3/ose-${component}:${version}', u'selector': u'region=infra', u'ports': [u'80:80', u'443:443']}) => {\"changed\": false, \"item\": {\"certificate\": {\"cafile\": \"/etc/origin//master/ca.crt\", \"certfile\": \"/etc/origin//master/openshift-router.crt\", \"keyfile\": \"/etc/origin//master/openshift-router.key\"}, \"edits\": [{\"action\": \"put\", \"key\": \"spec.strategy.rollingParams.intervalSeconds\", \"value\": 1}, {\"action\": \"put\", \"key\": \"spec.strategy.rollingParams.updatePeriodSeconds\", \"value\": 1}, {\"action\": \"put\", \"key\": \"spec.strategy.activeDeadlineSeconds\", \"value\": 21600}], \"images\": \"registry.access.redhat.com/openshift3/ose-${component}:${version}\", \"name\": \"router\", \"namespace\": \"default\", \"ports\": [\"80:80\", \"443:443\"], \"replicas\": \"1\", \"selector\": \"region=infra\", \"serviceaccount\": \"router\", \"stats_port\": 1936}, \"msg\": {\"results\": [{\"cmd\": \"/usr/bin/oc create -f /tmp/Secrets835J8 -n default\", \"results\": {}, \"returncode\": 0}, {\"cmd\": \"/usr/bin/oc create -f /tmp/ClusterRoleBindingMGVKFR -n default\", \"results\": {}, \"returncode\": 1, \"stderr\": \"Error from server: error when creating \\\"/tmp/ClusterRoleBindingMGVKFR\\\": etcdserver: request timed out\
\", \"stdout\": \"\"}, {\"cmd\": \"/usr/bin/oc create -f /tmp/ServicebKdroV -n default\", \"results\": {}, \"returncode\": 0}, {\"cmd\": \"/usr/bin/oc create -f /tmp/DeploymentConfigWnCCff -n default\", \"results\": {}, \"returncode\": 0}], \"returncode\": 1}}

@jim-minter
Copy link
Member Author

@jim-minter
Copy link
Member Author

@pweil- @Kargakis I suspect we're going to have to do something about this. Do you think doing increasing ETCD_ELECTION_TIMEOUT (currently 2.5s) might work as a stopgap?

@pweil-
Copy link
Collaborator

pweil- commented May 15, 2018

@jim-minter I'm not sure how ETCD_ELECTION_TIMEOUT would help in this situation. Isn't it for leader election issues (coupled with heartbeat interval)?

It seems like we need to wait for a better ready state or have retry mechanisms in place. Since this looks like it's in Ansible I'm thinking ready state in the extensions is maybe a better route.

Maybe we can poach os::cmd::internal::run_until_exit_code?

@jim-minter
Copy link
Member Author

See https://github.com/coreos/etcd/blob/master/Documentation/faq.md#why-does-etcd-lose-its-leader-from-disk-latency-spikes and https://coreos.com/etcd/docs/latest/tuning.html

I'm going to experiment with ETCD_ELECTION_TIMEOUT and see if that helps us as a starting point.

@pweil-
Copy link
Collaborator

pweil- commented May 16, 2018

You suspect disk latency? Let's see what happens when you set it, but it seems we're going to want to try and figure out the actual cause.

@jim-minter
Copy link
Member Author

openshift/origin#16248 is relevant but sadly not conclusive.
#2996 will record etcd logs for follow-up.

@0xmichalis
Copy link
Contributor

/kind flake

@0xmichalis
Copy link
Contributor

Hrm, we should probably switch to use SSD anyway

https://github.com/coreos/etcd/blob/master/Documentation/op-guide/hardware.md#disks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants