Node never becomes Ready #32

abhinavdahiya · 2018-11-12T23:32:53Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/657/pull-ci-openshift-installer-master-e2e-aws/1345?log#log
so saw this error in one of the CI runs, one of the master failed to become Ready:

NAME                           STATUS     ROLES     AGE       VERSION
ip-10-0-1-38.ec2.internal      Ready      master    29m       v1.11.0+d4cacc0
ip-10-0-128-19.ec2.internal    Ready      worker    24m       v1.11.0+d4cacc0
ip-10-0-156-32.ec2.internal    Ready      worker    24m       v1.11.0+d4cacc0
ip-10-0-174-183.ec2.internal   Ready      worker    23m       v1.11.0+d4cacc0
ip-10-0-27-9.ec2.internal      NotReady   master    29m       v1.11.0+d4cacc0
ip-10-0-46-179.ec2.internal    Ready      master    29m       v1.11.0+d4cacc0

and seeing the ocs pod on that node.
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/657/pull-ci-openshift-installer-master-e2e-aws/1345/artifacts/e2e-aws/pods/openshift-sdn_ovs-fn8d2_openvswitch.log.gz

/etc/openvswitch/conf.db does not exist ... (warning).
Creating empty database /etc/openvswitch/conf.db ovsdb-tool: I/O error: /etc/openvswitch/conf.db: failed to lock lockfile (Resource temporarily unavailable)
[FAILED]

/cc @squeed

smarterclayton · 2018-11-13T03:07:00Z

You can clear this by deleting the ovs pod and it'll recover, but it then looks like the openshift-kube-apiserver-operator gets hung and does nothing. I copied logs into team-master for them to investigate

wking · 2018-11-13T06:45:41Z

I had two workers never become ready for a recent CI run:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/nodes.json | jq '[.items[] | {"name": .metadata.name, "machine": .metadata.annotations.machine, "ready": (.status.conditions[] | select(.type == "Ready") | .status)}]'
[
  {
    "name": "ip-10-0-13-33.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-master-0",
    "ready": "True"
  },
  {
    "name": "ip-10-0-130-112.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-worker-us-east-1a-5ttjv",
    "ready": "False"
  },
  {
    "name": "ip-10-0-154-238.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-worker-us-east-1b-t9gzw",
    "ready": "False"
  },
  {
    "name": "ip-10-0-168-175.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-worker-us-east-1c-fd75b",
    "ready": "True"
  },
  {
    "name": "ip-10-0-17-71.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-master-1",
    "ready": "True"
  },
  {
    "name": "ip-10-0-33-245.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-master-2",
    "ready": "True"
  }
]

And from the crash-looping container logs, the same lockfile issue @abhinavdahiya saw:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/pods/openshift-sdn_ovs-hdjv5_openvswitch.log.gz | gunzip
/etc/openvswitch/conf.db does not exist ... (warning).
Creating empty database /etc/openvswitch/conf.db ovsdb-tool: I/O error: /etc/openvswitch/conf.db: failed to lock lockfile (Resource temporarily unavailable)
[FAILED]
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/pods/openshift-sdn_ovs-xjfdc_openvswitch.log.gz | gunzip
/etc/openvswitch/conf.db does not exist ... (warning).
Creating empty database /etc/openvswitch/conf.db ovsdb-tool: I/O error: /etc/openvswitch/conf.db: failed to lock lockfile (Resource temporarily unavailable)
[FAILED]

The symptoms in the e2e-aws build logs were:

error: watch closed before Until timeout
error openshift-ingress/ds/router-default did not come up
Waiting for daemon set "router-default" rollout to finish: 1 of 3 updated pods are available...
E1113 05:55:10.455454     694 streamwatcher.go:109] Unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)
error: watch closed before Until timeout
timeout waiting for openshift-ingress/ds/router-default to be available
2018/11/13 05:55:11 Container test in pod e2e-aws failed, exit code 1, reason Error

wking · 2018-11-13T06:59:00Z

This issue has come up a few times before on the OVS list. Here the guess was that there was an existing ovsdb-server running, although that wasn't what was happening there. Here it seemed to be an SELinux issue, but I don't see why that would work on some of our nodes and not others. From this comment in openshift/installer#600, it sounds like there may be SELinux context flipping for reasons that I haven't wrapped my head around yet. Maybe that's the source of the success on some nodes and failures on others. This comment floats the possibility of a CRI-O bug around runAsUser: 0.

wking · 2018-11-13T07:21:07Z

CRI-O issue is cri-o/cri-o#1904

squeed · 2018-11-13T10:32:28Z

Weird, it is intermittently unable to create the database file. There's absolutely nothing fancy about this at all - just a call to create(). Digging further... (this daemonset has worked very reliably on RHEL + Docker, so I suspect the switch to RHCOS+CRIO is exposing issues)

For those of you who are debugging, capture a journactl and see if there are SELinux errors.

I'm looking in to this now.

squeed · 2018-11-13T14:14:16Z

Nevermind, that's the error message when the lockfile is.. locked. Seems that an old process could be lying around.

Add some logging and killing logic in #33.

abhinavdahiya · 2018-11-13T18:56:04Z

For those of you who are debugging, capture a journactl and see if there are SELinux errors.

@squeed do you have some good trouble shooting steps that might come handy. we can add them to https://github.com/openshift/installer/blob/master/docs/user/troubleshooting.md

squeed · 2018-11-14T13:09:27Z

There's another reliability improvement in #37. Then we should start seeing fewer issues.

squeed · 2018-11-22T16:30:46Z

I think we can close this; the obvious SDN issues seem to have been resolved.

wking mentioned this issue Nov 13, 2018

pkg/destroy/libvirt: Use prefix-based deletion openshift/installer#660

Merged

wking mentioned this issue Nov 13, 2018

Inconsistent SELinux context for privileged pods cri-o/cri-o#1904

Closed

squeed mentioned this issue Nov 13, 2018

openshift-sdn: ovs daemonset improvements: #33

Merged

squeed closed this as completed Nov 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node never becomes Ready #32

Node never becomes Ready #32

abhinavdahiya commented Nov 12, 2018

smarterclayton commented Nov 13, 2018

wking commented Nov 13, 2018

wking commented Nov 13, 2018

wking commented Nov 13, 2018

squeed commented Nov 13, 2018

squeed commented Nov 13, 2018

abhinavdahiya commented Nov 13, 2018

squeed commented Nov 14, 2018

squeed commented Nov 22, 2018

Node never becomes Ready #32

Node never becomes Ready #32

Comments

abhinavdahiya commented Nov 12, 2018

smarterclayton commented Nov 13, 2018

wking commented Nov 13, 2018

wking commented Nov 13, 2018

wking commented Nov 13, 2018

squeed commented Nov 13, 2018

squeed commented Nov 13, 2018

abhinavdahiya commented Nov 13, 2018

squeed commented Nov 14, 2018

squeed commented Nov 22, 2018