Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase wait time for nodes going ready #61

Merged
merged 2 commits into from
Apr 15, 2019

Conversation

enxebre
Copy link
Member

@enxebre enxebre commented Apr 12, 2019

Occasionally some nodes remain unready for ever presumably due to
https://bugzilla.redhat.com/show_bug.cgi?id=1698253 which causes https://bugzilla.redhat.com/show_bug.cgi?id=1698624

Orthogonally some tests are timing out while the node eventually goes ready, hence this PR increases the polling time
See, all failures:
https://openshift-gce-devel.appspot.com/builds/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/261/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/
e.g:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/261/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/781/

ip-10-0-133-147.ec2.internal makes recover from deleted worker machines to fail:

E0412 08:06:16.949021    4971 framework.go:448] Node "ip-10-0-133-147.ec2.internal" is not ready
E0412 08:06:16.968104    4971 framework.go:448] Node "ip-10-0-133-147.ec2.internal" is not ready

while in the next test it eventually goes ready:

I0412 08:06:28.961206    4971 utils.go:233] Node "ip-10-0-133-147.ec2.internal". Ready: true. Unschedulable: false

We are timing out only recently since the time for a node to go ready increased slightly and still to a reasonable amount of time. Is difficult to say though the reason for this yet, might be related to crio changes, to skew between bootimage and machine-os-content image and pivoting, CI cloud rate limits, or similar factors.

Disables machine health check validation temporary

@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 12, 2019
pkg/e2e/framework/framework.go Show resolved Hide resolved
Copy link
Contributor

@bison bison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 12, 2019
@spangenberg
Copy link
Contributor

/lgtm

@frobware
Copy link
Contributor

/lgtm

@frobware
Copy link
Contributor

/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: frobware

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 12, 2019
@enxebre
Copy link
Member Author

enxebre commented Apr 12, 2019

unrelated failure, cluster failing to bootstrap:

level=info msg="Destroying the bootstrap resources..."
--
  | level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-jpbyf6xh-79b09.origin-ci-int-aws.dev.rhcloud.com:6443  to initialize..."
  | level=fatal msg="failed to initialize the cluster: Cluster operator console has not yet reported success: timed out waiting for the condition"

/retest

@enxebre
Copy link
Member Author

enxebre commented Apr 12, 2019

/retest

@enxebre
Copy link
Member Author

enxebre commented Apr 12, 2019

level=info msg="Destroying the bootstrap resources..."
--
  | level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-3v9g58zn-79b09.origin-ci-int-aws.dev.rhcloud.com:6443  to initialize..."
  | level=fatal msg="failed to initialize the cluster: Cluster operator console has not yet reported success: timed out waiting for the condition"

/retest

@vikaschoudhary16
Copy link
Contributor

 {
            "apiVersion": "config.openshift.io/v1",
            "kind": "ClusterOperator",
            "metadata": {
                "creationTimestamp": "2019-04-12T14:55:46Z",
                "generation": 1,
                "name": "kube-scheduler",
                "resourceVersion": "30723",
                "selfLink": "/apis/config.openshift.io/v1/clusteroperators/kube-scheduler",
                "uid": "0d9d8149-5d33-11e9-bc0f-12508e9d9c5e"
            },
            "spec": {},
            "status": {
                "conditions": [
                    {
                        "lastTransitionTime": "2019-04-12T15:33:36Z",
                        "reason": "NodeInstallerFailingInstallerPodFailed",
                        "status": "True",
                        "type": "Failing"
                    },

@enxebre
Copy link
Member Author

enxebre commented Apr 12, 2019

test actually run now, one failure:

[Fail] [Feature:Machines] Managed cluster should [It] grow and decrease when scaling different machineSets simultaneously
--
  | /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/e2e/infra/infra.go:340

Node "ip-10-0-148-147.ec2.internal" is not ready. Conditions are: [{MemoryPressure False 2019-04-12 16:23:31 +0000 UTC 2019-04-12 16:23:21 +0000 UTC KubeletHasSufficientMemory kubelet has sufficient memory available} {DiskPressure False 2019-04-12 16:23:31 +0000 UTC 2019-04-12 16:23:21 +0000 UTC KubeletHasNoDiskPressure kubelet has no disk pressure} {PIDPressure False 2019-04-12 16:23:31 +0000 UTC 2019-04-12 16:23:21 +0000 UTC KubeletHasSufficientPID kubelet has sufficient PID available} {Ready False 2019-04-12 16:23:31 +0000 UTC 2019-04-12 16:23:21 +0000 UTC KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized}]

presumably due to https://bugzilla.redhat.com/show_bug.cgi?id=1698253 which causes https://bugzilla.redhat.com/show_bug.cgi?id=1698624

@enxebre
Copy link
Member Author

enxebre commented Apr 12, 2019

/retest

@enxebre
Copy link
Member Author

enxebre commented Apr 12, 2019

Failed to bootstrap again
/retest

@ingvagabund
Copy link
Member

/hold

until #60 gets merged first

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 14, 2019
@enxebre
Copy link
Member Author

enxebre commented Apr 15, 2019

level=info msg="Waiting up to 30m0s for the bootstrap-complete event..."
level=info msg="Destroying the bootstrap resources..."
level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-vl7zg76b-79b09.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
level=fatal msg="failed to initialize the cluster: Cluster operator console has not yet reported success: timed out waiting for the condition"

/retest

@openshift-ci-robot openshift-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed lgtm Indicates that a PR is ready to be merged. labels Apr 15, 2019
@ingvagabund
Copy link
Member

/hold cancel
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2019
@enxebre
Copy link
Member Author

enxebre commented Apr 15, 2019

/test k8s-e2e

1 similar comment
@enxebre
Copy link
Member Author

enxebre commented Apr 15, 2019

/test k8s-e2e

Machine Health checking is tech preview, the e2e test increases the suite running time notably and the test has been failing recently. We need to reduce variables in order to get back CI green. See 54633c2
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2019
@ingvagabund
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2019
@enxebre
Copy link
Member Author

enxebre commented Apr 15, 2019


[Fail] [Feature:Operators] Machine API operator deployment should [It] reconcile controllers deployment
--
  | /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/e2e/operators/machine-api-operator.go:38

/retest

@enxebre
Copy link
Member Author

enxebre commented Apr 15, 2019

/retest

@enxebre
Copy link
Member Author

enxebre commented Apr 15, 2019

[Fail] [Feature:Machines] Managed cluster should [It] recover from deleted worker machines
--
  | /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/e2e/infra/infra.go:258
  |  
  | [Fail] [Feature:Machines] Managed cluster should [It] grow or decrease when scaling out or in
  | /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/e2e/infra/infra.go:268
  |  
  | [Fail] [Feature:Machines] Managed cluster should [It] grow and decrease when scaling different machineSets simultaneously
  | /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/e2e/infra/infra.go:332

ip-10-0-133-212.ec2.internal was removed from cluster? there's no trace on the arctifacts logs. From the e2e screen logging:

I0415 10:31:27.008003    4822 utils.go:233] Node "ip-10-0-133-212.ec2.internal". Ready: false. Unschedulable: false
--
  | I0415 10:31:27.008009    4822 utils.go:233] Node "ip-10-0-146-9.ec2.internal". Ready: true. Unschedulable: false
  | I0415 10:31:27.008013    4822 utils.go:233] Node "ip-10-0-157-82.ec2.internal". Ready: true. Unschedulable: false
  | I0415 10:31:27.008018    4822 utils.go:233] Node "ip-10-0-160-176.ec2.internal". Ready: true. Unschedulable: false
  | I0415 10:31:27.008026    4822 utils.go:233] Node "ip-10-0-162-225.ec2.internal". Ready: true. Unschedulable: false
  | I0415 10:31:27.049718    4822 utils.go:89] Cluster size is 6 nodes
  | I0415 10:31:27.049771    4822 utils.go:258] waiting for all nodes to be ready
  | E0415 10:31:27.096996    4822 framework.go:448] Node "ip-10-0-133-212.ec2.internal" is not ready
  | E0415 10:31:28.207743    4822 framework.go:448] Node "ip-10-0-133-212.ec2.internal" is not ready
  | E0415 10:31:29.208334    4822 framework.go:448] Node "ip-10-0-133-212.ec2.internal" is not ready
  | I0415 10:31:30.174338    4822 utils.go:263] waiting for all nodes to be schedulable
  | I0415 10:31:30.214886    4822 utils.go:290] Node "ip-10-0-128-8.ec2.internal" is schedulable
  | I0415 10:31:30.214939    4822 utils.go:290] Node "ip-10-0-146-9.ec2.internal" is schedulable
  | I0415 10:31:30.214944    4822 utils.go:290] Node "ip-10-0-157-82.ec2.internal" is schedulable
  | I0415 10:31:30.214948    4822 utils.go:290] Node "ip-10-0-160-176.ec2.internal" is schedulable
  | I0415 10:31:30.214951    4822 utils.go:290] Node "ip-10-0-162-225.ec2.internal" is schedulable
  | I0415 10:31:30.214958    4822 utils.go:268] waiting for each node to be backed by a machine
  | I0415 10:31:30.298112    4822 utils.go:49] Expecting the same number of machines and nodes, have 5 nodes and 6 machines
  | I0415 10:31:35.449887    4822 utils.go:49] Expecting the same number of machines and nodes, have 5 nodes and 6 machines

Nodes failing to go ready search for ip-10-0-128-5.ec2.internal in worker journal https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/61/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/171/artifacts/e2e-aws-operator/nodes/

@enxebre
Copy link
Member Author

enxebre commented Apr 15, 2019

/retest

@openshift-merge-robot openshift-merge-robot merged commit f9925b2 into openshift:master Apr 15, 2019
enxebre added a commit to enxebre/machine-api-operator that referenced this pull request Apr 15, 2019
frobware added a commit to frobware/cluster-autoscaler-operator that referenced this pull request Apr 16, 2019
frobware added a commit to frobware/cluster-api-provider-aws that referenced this pull request Apr 16, 2019
enxebre added a commit to enxebre/cluster-api-provider-aws-2 that referenced this pull request Apr 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants