openshift: tracker for etcd timeout issues #2918

jim-minter · 2018-05-10T20:42:43Z

No description provided.

jim-minter · 2018-05-10T20:42:50Z

jim-minter · 2018-05-10T20:44:27Z

https://circleci.com/gh/Azure/acs-engine/25995

TASK [openshift_hosted : Create OpenShift router] ******************************
failed: [localhost] (item={u'name': u'router', u'certificate': {u'certfile': u'/etc/origin//master/openshift-router.crt', u'keyfile': u'/etc/origin//master/openshift-router.key', u'cafile': u'/etc/origin//master/ca.crt'}, u'replicas': u'1', u'serviceaccount': u'router', u'namespace': u'default', u'stats_port': 1936, u'edits': [{u'action': u'put', u'key': u'spec.strategy.rollingParams.intervalSeconds', u'value': 1}, {u'action': u'put', u'key': u'spec.strategy.rollingParams.updatePeriodSeconds', u'value': 1}, {u'action': u'put', u'key': u'spec.strategy.activeDeadlineSeconds', u'value': 21600}], u'images': u'registry.access.redhat.com/openshift3/ose-${component}:${version}', u'selector': u'region=infra', u'ports': [u'80:80', u'443:443']}) => {\"changed\": false, \"item\": {\"certificate\": {\"cafile\": \"/etc/origin//master/ca.crt\", \"certfile\": \"/etc/origin//master/openshift-router.crt\", \"keyfile\": \"/etc/origin//master/openshift-router.key\"}, \"edits\": [{\"action\": \"put\", \"key\": \"spec.strategy.rollingParams.intervalSeconds\", \"value\": 1}, {\"action\": \"put\", \"key\": \"spec.strategy.rollingParams.updatePeriodSeconds\", \"value\": 1}, {\"action\": \"put\", \"key\": \"spec.strategy.activeDeadlineSeconds\", \"value\": 21600}], \"images\": \"registry.access.redhat.com/openshift3/ose-${component}:${version}\", \"name\": \"router\", \"namespace\": \"default\", \"ports\": [\"80:80\", \"443:443\"], \"replicas\": \"1\", \"selector\": \"region=infra\", \"serviceaccount\": \"router\", \"stats_port\": 1936}, \"msg\": {\"results\": [{\"cmd\": \"/usr/bin/oc create -f /tmp/Secrets835J8 -n default\", \"results\": {}, \"returncode\": 0}, {\"cmd\": \"/usr/bin/oc create -f /tmp/ClusterRoleBindingMGVKFR -n default\", \"results\": {}, \"returncode\": 1, \"stderr\": \"Error from server: error when creating \\\"/tmp/ClusterRoleBindingMGVKFR\\\": etcdserver: request timed out\
\", \"stdout\": \"\"}, {\"cmd\": \"/usr/bin/oc create -f /tmp/ServicebKdroV -n default\", \"results\": {}, \"returncode\": 0}, {\"cmd\": \"/usr/bin/oc create -f /tmp/DeploymentConfigWnCCff -n default\", \"results\": {}, \"returncode\": 0}], \"returncode\": 1}}

jim-minter · 2018-05-14T17:48:24Z

https://circleci.com/gh/Azure/acs-engine/26340

jim-minter · 2018-05-14T23:48:12Z

@pweil- @Kargakis I suspect we're going to have to do something about this. Do you think doing increasing ETCD_ELECTION_TIMEOUT (currently 2.5s) might work as a stopgap?

pweil- · 2018-05-15T13:17:51Z

@jim-minter I'm not sure how ETCD_ELECTION_TIMEOUT would help in this situation. Isn't it for leader election issues (coupled with heartbeat interval)?

It seems like we need to wait for a better ready state or have retry mechanisms in place. Since this looks like it's in Ansible I'm thinking ready state in the extensions is maybe a better route.

Maybe we can poach os::cmd::internal::run_until_exit_code?

jim-minter · 2018-05-15T22:38:25Z

See https://github.com/coreos/etcd/blob/master/Documentation/faq.md#why-does-etcd-lose-its-leader-from-disk-latency-spikes and https://coreos.com/etcd/docs/latest/tuning.html

I'm going to experiment with ETCD_ELECTION_TIMEOUT and see if that helps us as a starting point.

pweil- · 2018-05-16T12:34:42Z

You suspect disk latency? Let's see what happens when you set it, but it seems we're going to want to try and figure out the actual cause.

jim-minter · 2018-05-17T01:01:35Z

openshift/origin#16248 is relevant but sadly not conclusive.
#2996 will record etcd logs for follow-up.

0xmichalis · 2018-05-22T10:39:25Z

/kind flake

0xmichalis · 2018-06-05T09:50:53Z

Hrm, we should probably switch to use SSD anyway

https://github.com/coreos/etcd/blob/master/Documentation/op-guide/hardware.md#disks

CecileRobertMichon added the orchestrator/openshift label May 10, 2018

acs-bot added the kind/flake test is flaky label May 22, 2018

This was referenced Jun 1, 2018

Replacing the static pod definition for etcd ip. #3100

Merged

Simplify error handling in rollout test #3156

Merged

pweil- mentioned this issue Jun 12, 2018

add default distro for agents. #3020

Merged

CecileRobertMichon closed this as completed Nov 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openshift: tracker for etcd timeout issues #2918

openshift: tracker for etcd timeout issues #2918

jim-minter commented May 10, 2018

jim-minter commented May 10, 2018

jim-minter commented May 10, 2018

jim-minter commented May 14, 2018

jim-minter commented May 14, 2018

pweil- commented May 15, 2018

jim-minter commented May 15, 2018

pweil- commented May 16, 2018

jim-minter commented May 17, 2018

0xmichalis commented May 22, 2018

0xmichalis commented Jun 5, 2018

openshift: tracker for etcd timeout issues #2918

openshift: tracker for etcd timeout issues #2918

Comments

jim-minter commented May 10, 2018

jim-minter commented May 10, 2018

jim-minter commented May 10, 2018

jim-minter commented May 14, 2018

jim-minter commented May 14, 2018

pweil- commented May 15, 2018

jim-minter commented May 15, 2018

pweil- commented May 16, 2018

jim-minter commented May 17, 2018

0xmichalis commented May 22, 2018

0xmichalis commented Jun 5, 2018