Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UPSTREAM: <carry>: openshift: return no nodegroup when scaling size is zero #42

Merged
merged 3 commits into from
Feb 22, 2019

Conversation

frobware
Copy link

Change nodeGroupForNode() to return nil when the underlying node group
has no scaling bounds. The underlying MachineSet in these cases is
either missing the following annotations:

annotations:
  machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "1"
  machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "6"

or they are present but the result of max-size - min-size is zero.

This change helps to remove a lot of spurious error messages from the
core of the autoscaler logs at runtime. Without this change we see the
following printed every time the autoscaler scans the state of the
cluster (which defaults to 10s) and can be very misleading.

$ oc logs -f cluster-autoscaler-default-77f884d74b-527bq
E1204 08:26:01.996252       1 utils.go:467] Error while checking node group size worker-us-east-2a: group size not found
E1204 08:26:01.996545       1 utils.go:467] Error while checking node group size worker-us-east-2b: group size not found
E1204 08:26:01.996678       1 utils.go:467] Error while checking node group size worker-us-east-2c: group size not found

Those unit tests that were previously relying on nodeGroupForNode() to
always return a node group when no bounds had been specified have been
modified so that they now have a scaling bound >= 1.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1656334

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 21, 2019
@frobware
Copy link
Author

/cc @ingvagabund @enxebre @bison

@@ -415,6 +421,12 @@ func (c *machineController) nodeGroupForNode(node *apiv1.Node) (cloudprovider.No
return nil, fmt.Errorf("failed to build nodegroup for node %q: %v", node.Name, err)
}

// We don't scale from 0 so nodes must belong to a nodegroup
// that has a scale size of at least 1.
if nodegroup.MaxSize()-nodegroup.MinSize() < 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a testcase where you set both sizes to be equal and verify you get nil node group?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test cases now covered in TestControllerNodeGroupForNodeLookup() - 56bc385

@frobware
Copy link
Author

/test e2e-aws-operator

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 22, 2019
@frobware frobware force-pushed the fix-bz-1656334 branch 4 times, most recently from ce7f209 to 084f847 Compare February 22, 2019 14:04
These will be used/shared by the controller tests.
@ingvagabund
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2019
@frobware
Copy link
Author

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 22, 2019
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2019
@frobware
Copy link
Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 22, 2019
@frobware frobware changed the title UPSTREAM: <carry>: openshift: return no nodegroup when scaling size == 0 UPSTREAM: <carry>: openshift: return no nodegroup when scaling size is zero Feb 22, 2019
@frobware
Copy link
Author

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 22, 2019
@frobware
Copy link
Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 22, 2019
… == 0

Change nodeGroupForNode() to return nil when the underlying node group
has no scaling bounds. The underlying MachineSet in these cases is
either missing the following annotations:

```
annotations:
  machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "1"
  machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "6"
```

or they are present but the result of `max-size - min-size` is zero.

This change helps to remove a lot of spurious error messages from the
core of the autoscaler logs at runtime. Without this change we see the
following printed every time the autoscaler scans the state of the
cluster (which defaults to 10s) and can be very misleading.

```console
$ oc logs -f cluster-autoscaler-default-77f884d74b-527bq
E1204 08:26:01.996252       1 utils.go:467] Error while checking node group size worker-us-east-2a: group size not found
E1204 08:26:01.996545       1 utils.go:467] Error while checking node group size worker-us-east-2b: group size not found
E1204 08:26:01.996678       1 utils.go:467] Error while checking node group size worker-us-east-2c: group size not found
```

Those unit tests that were previously relying on nodeGroupForNode() to
always return a node group when no bounds had been specified have been
modified so that they now have a scaling bound >= 1.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1656334
These are now covered by TestControllerNodeGroupForNodeLookup
@ingvagabund
Copy link
Member

/lgtm

Thanks for the improvements

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2019
@openshift-merge-robot openshift-merge-robot merged commit d8a4a30 into openshift:master Feb 22, 2019
@ingvagabund ingvagabund deleted the fix-bz-1656334 branch February 22, 2019 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants