Increase wait time for nodes going ready #61

enxebre · 2019-04-12T13:15:38Z

Occasionally some nodes remain unready for ever presumably due to
https://bugzilla.redhat.com/show_bug.cgi?id=1698253 which causes https://bugzilla.redhat.com/show_bug.cgi?id=1698624

Orthogonally some tests are timing out while the node eventually goes ready, hence this PR increases the polling time
See, all failures:
https://openshift-gce-devel.appspot.com/builds/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/261/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/
e.g:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/261/pull-ci-openshift-machine-api-operator-master-e2e-aws-operator/781/

ip-10-0-133-147.ec2.internal makes recover from deleted worker machines to fail:

E0412 08:06:16.949021    4971 framework.go:448] Node "ip-10-0-133-147.ec2.internal" is not ready
E0412 08:06:16.968104    4971 framework.go:448] Node "ip-10-0-133-147.ec2.internal" is not ready

while in the next test it eventually goes ready:

I0412 08:06:28.961206    4971 utils.go:233] Node "ip-10-0-133-147.ec2.internal". Ready: true. Unschedulable: false

We are timing out only recently since the time for a node to go ready increased slightly and still to a reasonable amount of time. Is difficult to say though the reason for this yet, might be related to crio changes, to skew between bootimage and machine-os-content image and pivoting, CI cloud rate limits, or similar factors.

Disables machine health check validation temporary

pkg/e2e/framework/framework.go

bison

/lgtm

spangenberg · 2019-04-12T13:40:14Z

/lgtm

frobware · 2019-04-12T13:41:38Z

/lgtm

frobware · 2019-04-12T13:43:16Z

/approve

openshift-ci-robot · 2019-04-12T13:43:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: frobware

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [frobware]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enxebre · 2019-04-12T14:34:00Z

unrelated failure, cluster failing to bootstrap:

level=info msg="Destroying the bootstrap resources..."
--
  | level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-jpbyf6xh-79b09.origin-ci-int-aws.dev.rhcloud.com:6443  to initialize..."
  | level=fatal msg="failed to initialize the cluster: Cluster operator console has not yet reported success: timed out waiting for the condition"

/retest

enxebre · 2019-04-12T14:36:35Z

/retest

enxebre · 2019-04-12T15:41:38Z

level=info msg="Destroying the bootstrap resources..."
--
  | level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-3v9g58zn-79b09.origin-ci-int-aws.dev.rhcloud.com:6443  to initialize..."
  | level=fatal msg="failed to initialize the cluster: Cluster operator console has not yet reported success: timed out waiting for the condition"

/retest

vikaschoudhary16 · 2019-04-12T16:41:04Z

 {
            "apiVersion": "config.openshift.io/v1",
            "kind": "ClusterOperator",
            "metadata": {
                "creationTimestamp": "2019-04-12T14:55:46Z",
                "generation": 1,
                "name": "kube-scheduler",
                "resourceVersion": "30723",
                "selfLink": "/apis/config.openshift.io/v1/clusteroperators/kube-scheduler",
                "uid": "0d9d8149-5d33-11e9-bc0f-12508e9d9c5e"
            },
            "spec": {},
            "status": {
                "conditions": [
                    {
                        "lastTransitionTime": "2019-04-12T15:33:36Z",
                        "reason": "NodeInstallerFailingInstallerPodFailed",
                        "status": "True",
                        "type": "Failing"
                    },

enxebre · 2019-04-12T16:48:40Z

test actually run now, one failure:

[Fail] [Feature:Machines] Managed cluster should [It] grow and decrease when scaling different machineSets simultaneously
--
  | /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/e2e/infra/infra.go:340

Node "ip-10-0-148-147.ec2.internal" is not ready. Conditions are: [{MemoryPressure False 2019-04-12 16:23:31 +0000 UTC 2019-04-12 16:23:21 +0000 UTC KubeletHasSufficientMemory kubelet has sufficient memory available} {DiskPressure False 2019-04-12 16:23:31 +0000 UTC 2019-04-12 16:23:21 +0000 UTC KubeletHasNoDiskPressure kubelet has no disk pressure} {PIDPressure False 2019-04-12 16:23:31 +0000 UTC 2019-04-12 16:23:21 +0000 UTC KubeletHasSufficientPID kubelet has sufficient PID available} {Ready False 2019-04-12 16:23:31 +0000 UTC 2019-04-12 16:23:21 +0000 UTC KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni config uninitialized}]

presumably due to https://bugzilla.redhat.com/show_bug.cgi?id=1698253 which causes https://bugzilla.redhat.com/show_bug.cgi?id=1698624

enxebre · 2019-04-12T16:51:43Z

/retest

enxebre · 2019-04-12T17:48:24Z

Failed to bootstrap again
/retest

ingvagabund · 2019-04-14T09:30:01Z

/hold

until #60 gets merged first

enxebre · 2019-04-15T07:13:37Z

level=info msg="Waiting up to 30m0s for the bootstrap-complete event..."
level=info msg="Destroying the bootstrap resources..."
level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-vl7zg76b-79b09.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
level=fatal msg="failed to initialize the cluster: Cluster operator console has not yet reported success: timed out waiting for the condition"

/retest

ingvagabund · 2019-04-15T07:55:34Z

/hold cancel
/lgtm

enxebre · 2019-04-15T08:15:03Z

/test k8s-e2e

enxebre · 2019-04-15T08:19:16Z

/test k8s-e2e

Machine Health checking is tech preview, the e2e test increases the suite running time notably and the test has been failing recently. We need to reduce variables in order to get back CI green. See 54633c2

ingvagabund · 2019-04-15T08:55:50Z

/lgtm

done

enxebre · 2019-04-15T09:43:47Z


[Fail] [Feature:Operators] Machine API operator deployment should [It] reconcile controllers deployment
--
  | /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/e2e/operators/machine-api-operator.go:38

/retest

enxebre · 2019-04-15T09:47:23Z

/retest

enxebre · 2019-04-15T11:17:50Z

[Fail] [Feature:Machines] Managed cluster should [It] recover from deleted worker machines
--
  | /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/e2e/infra/infra.go:258
  |  
  | [Fail] [Feature:Machines] Managed cluster should [It] grow or decrease when scaling out or in
  | /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/e2e/infra/infra.go:268
  |  
  | [Fail] [Feature:Machines] Managed cluster should [It] grow and decrease when scaling different machineSets simultaneously
  | /go/src/github.com/openshift/cluster-api-actuator-pkg/pkg/e2e/infra/infra.go:332

ip-10-0-133-212.ec2.internal was removed from cluster? there's no trace on the arctifacts logs. From the e2e screen logging:

I0415 10:31:27.008003    4822 utils.go:233] Node "ip-10-0-133-212.ec2.internal". Ready: false. Unschedulable: false
--
  | I0415 10:31:27.008009    4822 utils.go:233] Node "ip-10-0-146-9.ec2.internal". Ready: true. Unschedulable: false
  | I0415 10:31:27.008013    4822 utils.go:233] Node "ip-10-0-157-82.ec2.internal". Ready: true. Unschedulable: false
  | I0415 10:31:27.008018    4822 utils.go:233] Node "ip-10-0-160-176.ec2.internal". Ready: true. Unschedulable: false
  | I0415 10:31:27.008026    4822 utils.go:233] Node "ip-10-0-162-225.ec2.internal". Ready: true. Unschedulable: false
  | I0415 10:31:27.049718    4822 utils.go:89] Cluster size is 6 nodes
  | I0415 10:31:27.049771    4822 utils.go:258] waiting for all nodes to be ready
  | E0415 10:31:27.096996    4822 framework.go:448] Node "ip-10-0-133-212.ec2.internal" is not ready
  | E0415 10:31:28.207743    4822 framework.go:448] Node "ip-10-0-133-212.ec2.internal" is not ready
  | E0415 10:31:29.208334    4822 framework.go:448] Node "ip-10-0-133-212.ec2.internal" is not ready
  | I0415 10:31:30.174338    4822 utils.go:263] waiting for all nodes to be schedulable
  | I0415 10:31:30.214886    4822 utils.go:290] Node "ip-10-0-128-8.ec2.internal" is schedulable
  | I0415 10:31:30.214939    4822 utils.go:290] Node "ip-10-0-146-9.ec2.internal" is schedulable
  | I0415 10:31:30.214944    4822 utils.go:290] Node "ip-10-0-157-82.ec2.internal" is schedulable
  | I0415 10:31:30.214948    4822 utils.go:290] Node "ip-10-0-160-176.ec2.internal" is schedulable
  | I0415 10:31:30.214951    4822 utils.go:290] Node "ip-10-0-162-225.ec2.internal" is schedulable
  | I0415 10:31:30.214958    4822 utils.go:268] waiting for each node to be backed by a machine
  | I0415 10:31:30.298112    4822 utils.go:49] Expecting the same number of machines and nodes, have 5 nodes and 6 machines
  | I0415 10:31:35.449887    4822 utils.go:49] Expecting the same number of machines and nodes, have 5 nodes and 6 machines

Nodes failing to go ready search for ip-10-0-128-5.ec2.internal in worker journal https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/61/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/171/artifacts/e2e-aws-operator/nodes/

enxebre · 2019-04-15T11:17:57Z

/retest

…-actuator-pkg#61

Bumping cluster-api-actuator-pkg deps for: - openshift/cluster-api-actuator-pkg#61 - openshift/cluster-api-actuator-pkg#63

Get openshift/cluster-api-actuator-pkg#63 and openshift/cluster-api-actuator-pkg#61

openshift-ci-robot requested review from ingvagabund and paulfantom April 12, 2019 13:15

openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 12, 2019

michaelgugino previously requested changes Apr 12, 2019

View reviewed changes

pkg/e2e/framework/framework.go Show resolved Hide resolved

enxebre force-pushed the fixes branch from 39c10c5 to 5ce0c1c Compare April 12, 2019 13:35

bison approved these changes Apr 12, 2019

View reviewed changes

openshift-ci-robot assigned bison Apr 12, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 12, 2019

openshift-ci-robot assigned spangenberg Apr 12, 2019

openshift-ci-robot assigned frobware Apr 12, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 12, 2019

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 14, 2019

enxebre force-pushed the fixes branch from 5ce0c1c to 636e6fb Compare April 15, 2019 07:47

openshift-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed lgtm Indicates that a PR is ready to be merged. labels Apr 15, 2019

openshift-ci-robot assigned ingvagabund Apr 15, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2019

Skip machine healtch check

cf41e6a

Machine Health checking is tech preview, the e2e test increases the suite running time notably and the test has been failing recently. We need to reduce variables in order to get back CI green. See 54633c2

enxebre force-pushed the fixes branch from 44e7a66 to cf41e6a Compare April 15, 2019 08:48

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2019

openshift-merge-robot merged commit f9925b2 into openshift:master Apr 15, 2019

enxebre added a commit to enxebre/machine-api-operator that referenced this pull request Apr 15, 2019

Re vendor for cluster-api-actuator-pkg to bring openshift/cluster-api…

a03d130

…-actuator-pkg#61

enxebre mentioned this pull request Apr 15, 2019

Re vendor for cluster-api-actuator-pkg openshift/machine-api-operator#286

Merged

frobware mentioned this pull request Apr 16, 2019

UPSTREAM: <carry>: openshift: bump test/openshift deps openshift/kubernetes-autoscaler#82

Merged

This was referenced Apr 16, 2019

Remove myself from OWNERS file openshift/machine-api-operator#283

Merged

Use BareMetal platform constant from openshift/api openshift/machine-api-operator#275

Merged

frobware added a commit to frobware/cluster-autoscaler-operator that referenced this pull request Apr 16, 2019

bump cluster-api-actuator-pkg

e65fd0e

Bumping cluster-api-actuator-pkg deps for: - openshift/cluster-api-actuator-pkg#61 - openshift/cluster-api-actuator-pkg#63

frobware mentioned this pull request Apr 16, 2019

bump cluster-api-actuator-pkg openshift/cluster-autoscaler-operator#88

Merged

enxebre mentioned this pull request Apr 16, 2019

Revendor for cluster-api-actuator-pkg openshift/cluster-api-provider-aws#196

Closed

frobware mentioned this pull request Apr 16, 2019

bump cluster-api-actuator-pkg openshift/cluster-api-provider-aws#197

Merged

frobware added a commit to frobware/cluster-api-provider-aws that referenced this pull request Apr 16, 2019

bump cluster-api-actuator-pkg

47019be

Bumping cluster-api-actuator-pkg deps for: - openshift/cluster-api-actuator-pkg#61 - openshift/cluster-api-actuator-pkg#63

enxebre added a commit to enxebre/cluster-api-provider-aws-2 that referenced this pull request Apr 16, 2019

Revendor for cluster-api-actuator-pkg

bdf3443

Get openshift/cluster-api-actuator-pkg#63 and openshift/cluster-api-actuator-pkg#61

frobware mentioned this pull request Apr 16, 2019

bump cluster-api-actuator-pkg and clean up Gopkg.toml openshift/cluster-api-provider-aws#199

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase wait time for nodes going ready #61

Increase wait time for nodes going ready #61

enxebre commented Apr 12, 2019 •

edited

Loading

bison left a comment

spangenberg commented Apr 12, 2019

frobware commented Apr 12, 2019

frobware commented Apr 12, 2019

openshift-ci-robot commented Apr 12, 2019

enxebre commented Apr 12, 2019 •

edited

Loading

enxebre commented Apr 12, 2019

enxebre commented Apr 12, 2019

vikaschoudhary16 commented Apr 12, 2019

enxebre commented Apr 12, 2019

enxebre commented Apr 12, 2019

enxebre commented Apr 12, 2019

ingvagabund commented Apr 14, 2019

enxebre commented Apr 15, 2019 •

edited

Loading

ingvagabund commented Apr 15, 2019

enxebre commented Apr 15, 2019

enxebre commented Apr 15, 2019

ingvagabund commented Apr 15, 2019

enxebre commented Apr 15, 2019

enxebre commented Apr 15, 2019

enxebre commented Apr 15, 2019

enxebre commented Apr 15, 2019

Increase wait time for nodes going ready #61

Increase wait time for nodes going ready #61

Conversation

enxebre commented Apr 12, 2019 • edited Loading

bison left a comment

Choose a reason for hiding this comment

spangenberg commented Apr 12, 2019

frobware commented Apr 12, 2019

frobware commented Apr 12, 2019

openshift-ci-robot commented Apr 12, 2019

enxebre commented Apr 12, 2019 • edited Loading

enxebre commented Apr 12, 2019

enxebre commented Apr 12, 2019

vikaschoudhary16 commented Apr 12, 2019

enxebre commented Apr 12, 2019

enxebre commented Apr 12, 2019

enxebre commented Apr 12, 2019

ingvagabund commented Apr 14, 2019

enxebre commented Apr 15, 2019 • edited Loading

ingvagabund commented Apr 15, 2019

enxebre commented Apr 15, 2019

enxebre commented Apr 15, 2019

ingvagabund commented Apr 15, 2019

enxebre commented Apr 15, 2019

enxebre commented Apr 15, 2019

enxebre commented Apr 15, 2019

enxebre commented Apr 15, 2019

enxebre commented Apr 12, 2019 •

edited

Loading

enxebre commented Apr 12, 2019 •

edited

Loading

enxebre commented Apr 15, 2019 •

edited

Loading