Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node never becomes Ready #32

Closed
abhinavdahiya opened this issue Nov 12, 2018 · 9 comments
Closed

Node never becomes Ready #32

abhinavdahiya opened this issue Nov 12, 2018 · 9 comments

Comments

@abhinavdahiya
Copy link
Contributor

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/657/pull-ci-openshift-installer-master-e2e-aws/1345?log#log
so saw this error in one of the CI runs, one of the master failed to become Ready:

NAME                           STATUS     ROLES     AGE       VERSION
ip-10-0-1-38.ec2.internal      Ready      master    29m       v1.11.0+d4cacc0
ip-10-0-128-19.ec2.internal    Ready      worker    24m       v1.11.0+d4cacc0
ip-10-0-156-32.ec2.internal    Ready      worker    24m       v1.11.0+d4cacc0
ip-10-0-174-183.ec2.internal   Ready      worker    23m       v1.11.0+d4cacc0
ip-10-0-27-9.ec2.internal      NotReady   master    29m       v1.11.0+d4cacc0
ip-10-0-46-179.ec2.internal    Ready      master    29m       v1.11.0+d4cacc0

and seeing the ocs pod on that node.
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/657/pull-ci-openshift-installer-master-e2e-aws/1345/artifacts/e2e-aws/pods/openshift-sdn_ovs-fn8d2_openvswitch.log.gz

/etc/openvswitch/conf.db does not exist ... (warning).
Creating empty database /etc/openvswitch/conf.db ovsdb-tool: I/O error: /etc/openvswitch/conf.db: failed to lock lockfile (Resource temporarily unavailable)
[FAILED]

/cc @squeed

@smarterclayton
Copy link
Contributor

You can clear this by deleting the ovs pod and it'll recover, but it then looks like the openshift-kube-apiserver-operator gets hung and does nothing. I copied logs into team-master for them to investigate

@wking
Copy link
Member

wking commented Nov 13, 2018

I had two workers never become ready for a recent CI run:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/nodes.json | jq '[.items[] | {"name": .metadata.name, "machine": .metadata.annotations.machine, "ready": (.status.conditions[] | select(.type == "Ready") | .status)}]'
[
  {
    "name": "ip-10-0-13-33.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-master-0",
    "ready": "True"
  },
  {
    "name": "ip-10-0-130-112.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-worker-us-east-1a-5ttjv",
    "ready": "False"
  },
  {
    "name": "ip-10-0-154-238.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-worker-us-east-1b-t9gzw",
    "ready": "False"
  },
  {
    "name": "ip-10-0-168-175.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-worker-us-east-1c-fd75b",
    "ready": "True"
  },
  {
    "name": "ip-10-0-17-71.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-master-1",
    "ready": "True"
  },
  {
    "name": "ip-10-0-33-245.ec2.internal",
    "machine": "openshift-cluster-api/ci-op-ptkkr88f-1d3f3-master-2",
    "ready": "True"
  }
]

And from the crash-looping container logs, the same lockfile issue @abhinavdahiya saw:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/pods/openshift-sdn_ovs-hdjv5_openvswitch.log.gz | gunzip
/etc/openvswitch/conf.db does not exist ... (warning).
Creating empty database /etc/openvswitch/conf.db ovsdb-tool: I/O error: /etc/openvswitch/conf.db: failed to lock lockfile (Resource temporarily unavailable)
[FAILED]
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/660/pull-ci-openshift-installer-master-e2e-aws/1356/artifacts/e2e-aws/pods/openshift-sdn_ovs-xjfdc_openvswitch.log.gz | gunzip
/etc/openvswitch/conf.db does not exist ... (warning).
Creating empty database /etc/openvswitch/conf.db ovsdb-tool: I/O error: /etc/openvswitch/conf.db: failed to lock lockfile (Resource temporarily unavailable)
[FAILED]

The symptoms in the e2e-aws build logs were:

error: watch closed before Until timeout
error openshift-ingress/ds/router-default did not come up
Waiting for daemon set "router-default" rollout to finish: 1 of 3 updated pods are available...
E1113 05:55:10.455454     694 streamwatcher.go:109] Unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)
error: watch closed before Until timeout
timeout waiting for openshift-ingress/ds/router-default to be available
2018/11/13 05:55:11 Container test in pod e2e-aws failed, exit code 1, reason Error

@wking
Copy link
Member

wking commented Nov 13, 2018

This issue has come up a few times before on the OVS list. Here the guess was that there was an existing ovsdb-server running, although that wasn't what was happening there. Here it seemed to be an SELinux issue, but I don't see why that would work on some of our nodes and not others. From this comment in openshift/installer#600, it sounds like there may be SELinux context flipping for reasons that I haven't wrapped my head around yet. Maybe that's the source of the success on some nodes and failures on others. This comment floats the possibility of a CRI-O bug around runAsUser: 0.

@wking
Copy link
Member

wking commented Nov 13, 2018

CRI-O issue is cri-o/cri-o#1904

@squeed
Copy link
Contributor

squeed commented Nov 13, 2018

Weird, it is intermittently unable to create the database file. There's absolutely nothing fancy about this at all - just a call to create(). Digging further... (this daemonset has worked very reliably on RHEL + Docker, so I suspect the switch to RHCOS+CRIO is exposing issues)

For those of you who are debugging, capture a journactl and see if there are SELinux errors.

I'm looking in to this now.

@squeed
Copy link
Contributor

squeed commented Nov 13, 2018

Nevermind, that's the error message when the lockfile is.. locked. Seems that an old process could be lying around.

Add some logging and killing logic in #33.

@abhinavdahiya
Copy link
Contributor Author

For those of you who are debugging, capture a journactl and see if there are SELinux errors.

@squeed do you have some good trouble shooting steps that might come handy. we can add them to https://github.com/openshift/installer/blob/master/docs/user/troubleshooting.md

@squeed
Copy link
Contributor

squeed commented Nov 14, 2018

There's another reliability improvement in #37. Then we should start seeing fewer issues.

@squeed
Copy link
Contributor

squeed commented Nov 22, 2018

I think we can close this; the obvious SDN issues seem to have been resolved.

@squeed squeed closed this as completed Nov 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants