BUG 1824215: Allow small tolerance on memory capacity when comparing nodegroups #152

JoelSpeed · 2020-05-07T17:08:24Z

This allows a small tolerance in the memory capacity of nodes to allow better matching of similar node groups. There are differences in the memory values that Kubernetes interprets due to variances in the instances that a cloud provider provides.

Also adds tests that match real values from a real set of nodes that would be expected to be the same (the same instance type across multiple availability zones within a given region)

Eg. In testing I saw AWS m5.xlarge nodes with capacities such as 16116152Ki and 15944120Ki not only across availability zones, but within the same availability zone after a few cycles through machines. This is a difference on 168Mi which is much larger than the original tolerance of 128000 Bytes which was preventing BalanceSimilarNodeGroups from balancing across these availability zones.

openshift-ci-robot · 2020-05-07T17:09:01Z

@JoelSpeed: This pull request references Bugzilla bug 1824215, which is invalid:

expected the bug to be in one of the following states: NEW, ASSIGNED, ON_DEV, POST, POST, but it is MODIFIED instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

BUG 1824215: Use quantities for memory capacity differences

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

JoelSpeed · 2020-05-07T17:09:30Z

/bugzilla refresh

openshift-ci-robot · 2020-05-07T17:10:38Z

@JoelSpeed: This pull request references Bugzilla bug 1824215, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

elmiko

this looks nice to me, thanks Joel!

/approve

openshift-ci-robot · 2020-05-07T17:44:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [elmiko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enxebre · 2020-05-11T08:24:12Z

Thanks! This is purely autoscaler core, can we get a counter PR upstream?

enxebre · 2020-05-11T08:32:35Z

/lgtm

openshift-bot · 2020-05-11T08:34:24Z

/retest

Please review the full test history for this PR and help us cut down flakes.

enxebre · 2020-05-11T08:42:47Z

cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go

@@ -124,8 +126,10 @@ func IsCloudProviderNodeInfoSimilar(n1, n2 *schedulernodeinfo.NodeInfo, ignoredL
 		switch kind {
 		case apiv1.ResourceMemory:
 			// For memory capacity we allow a small tolerance
-			memoryDifference := math.Abs(float64(qtyList[0].Value()) - float64(qtyList[1].Value()))
-			if memoryDifference > MaxMemoryDifferenceInKiloBytes {
+			difference := absSub(qtyList[0], qtyList[1])


Could just use math.Abs?

I wanted to keep all of the quantities as resource.Quantity's, so this helper allows us to do that and reduce the conversions to/from integers to reduce the likelihood of a mistake being made there

enxebre · 2020-05-11T08:55:25Z

cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go

+
+var (
+	// MaxMemoryDifference describes how much memory capacity can differ but still be considered equal.
+	MaxMemoryDifference = resource.MustParse("256Mi")


How big is the diff we are seeing in real nodes?
Tolerating 256Mi seems too much as to consider nodeGroups equal. The original intention was to tolerate 128Ki kubernetes@e8b3c2a.
I think this should be 256Ki.

Please see the test case I've added below which came from a real world test case. The values that came through the code (via much debug logging) were 16116152Ki and 15944120Ki, which is 168Mi, just over a 1% difference in this case

Please also review the attached BZ which has more details from a customer who report differences in a similar magnitude.

Wouldn't this MaxMemoryDifference also apply to much more smaller instances to the point of making the check loosing its value?
i.e If the possible diff range increase with the instance size, should we may be make our tolerance window a percentage of the given total size?

Let's keep the discussion to the upstream PR for better visibility https://github.com/kubernetes/autoscaler/pull/3124/files#r422931565

enxebre · 2020-05-11T08:55:49Z

/lgtm cancel
/hold
to discuss #152 (comment)

JoelSpeed · 2020-05-11T09:51:02Z

Counter PR will be created shortly

JoelSpeed · 2020-05-11T10:05:45Z

Upstream kubernetes#3124

…hen comparing nodegroups This allows developers to better interpet how the calculations are being done by leaving the values as "Quantities". For example, the max difference is now a string converted to a quantity which will be easier to reason about and update if needed in the future. Also adds tests that match real values from a real set of nodes that would be expected to be the same

openshift-ci-robot · 2020-06-11T10:06:00Z

@JoelSpeed: This pull request references Bugzilla bug 1824215, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

BUG 1824215: Allow small tolerance on memory capacity when comparing nodegroups

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

JoelSpeed · 2020-06-11T10:06:33Z

/hold cancel

This has been updated to reflect the upstream implementation which should be merging within the next week

openshift-ci-robot · 2020-06-11T11:53:28Z

@JoelSpeed: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-azure-operator	`90751c4`	link	`/test e2e-azure-operator`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

elmiko · 2020-06-11T12:31:48Z

thanks Joel!
/lgtm

openshift-ci-robot · 2020-06-11T12:36:30Z

@JoelSpeed: All pull requests linked via external trackers have merged: openshift/kubernetes-autoscaler#152, openshift/kubernetes-autoscaler#144. Bugzilla bug 1824215 has been moved to the MODIFIED state.

In response to this:

BUG 1824215: Allow small tolerance on memory capacity when comparing nodegroups

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

JoelSpeed · 2020-06-15T09:43:47Z

/cherry-pick release-4.5

openshift-cherrypick-robot · 2020-06-15T09:44:03Z

@JoelSpeed: new pull request created: #157

In response to this:

/cherry-pick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from alexander-demicev and frobware May 7, 2020 17:08

JoelSpeed changed the title ~~UPSTREAM: <carry>: openshift: Use quantities for memory capacity differences~~ BUG 1824215: Use quantities for memory capacity differences May 7, 2020

openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 7, 2020

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 7, 2020

elmiko approved these changes May 7, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2020

openshift-ci-robot assigned enxebre May 11, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 11, 2020

enxebre reviewed May 11, 2020

View reviewed changes

openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed lgtm Indicates that a PR is ready to be merged. labels May 11, 2020

JoelSpeed force-pushed the quantity-memory-difference branch from d52511d to d26853f Compare May 11, 2020 10:04

enxebre mentioned this pull request May 11, 2020

Allow small tolerance on memory capacity when comparing nodegroups kubernetes/autoscaler#3124

Merged

JoelSpeed force-pushed the quantity-memory-difference branch from b19f2ee to 90751c4 Compare June 11, 2020 10:04

JoelSpeed changed the title ~~BUG 1824215: Use quantities for memory capacity differences~~ BUG 1824215: Allow small tolerance on memory capacity when comparing nodegroups Jun 11, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 11, 2020

openshift-ci-robot assigned elmiko Jun 11, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 11, 2020

openshift-merge-robot merged commit 4abdca5 into openshift:master Jun 11, 2020

openshift-cherrypick-robot mentioned this pull request Jun 15, 2020

[release-4.5] BUG 1846967: Allow small tolerance on memory capacity when comparing nodegroups #157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG 1824215: Allow small tolerance on memory capacity when comparing nodegroups #152

BUG 1824215: Allow small tolerance on memory capacity when comparing nodegroups #152

JoelSpeed commented May 7, 2020 •

edited

Loading

openshift-ci-robot commented May 7, 2020

JoelSpeed commented May 7, 2020

openshift-ci-robot commented May 7, 2020

elmiko left a comment

openshift-ci-robot commented May 7, 2020

enxebre commented May 11, 2020

enxebre commented May 11, 2020

openshift-bot commented May 11, 2020

enxebre May 11, 2020

JoelSpeed May 11, 2020

enxebre May 11, 2020 •

edited

Loading

JoelSpeed May 11, 2020

enxebre May 11, 2020

enxebre May 11, 2020

enxebre commented May 11, 2020

JoelSpeed commented May 11, 2020

JoelSpeed commented May 11, 2020

openshift-ci-robot commented Jun 11, 2020

JoelSpeed commented Jun 11, 2020

openshift-ci-robot commented Jun 11, 2020

elmiko commented Jun 11, 2020

openshift-ci-robot commented Jun 11, 2020

JoelSpeed commented Jun 15, 2020

openshift-cherrypick-robot commented Jun 15, 2020

BUG 1824215: Allow small tolerance on memory capacity when comparing nodegroups #152

BUG 1824215: Allow small tolerance on memory capacity when comparing nodegroups #152

Conversation

JoelSpeed commented May 7, 2020 • edited Loading

openshift-ci-robot commented May 7, 2020

JoelSpeed commented May 7, 2020

openshift-ci-robot commented May 7, 2020

elmiko left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented May 7, 2020

enxebre commented May 11, 2020

enxebre commented May 11, 2020

openshift-bot commented May 11, 2020

enxebre May 11, 2020

Choose a reason for hiding this comment

JoelSpeed May 11, 2020

Choose a reason for hiding this comment

enxebre May 11, 2020 • edited Loading

Choose a reason for hiding this comment

JoelSpeed May 11, 2020

Choose a reason for hiding this comment

enxebre May 11, 2020

Choose a reason for hiding this comment

enxebre May 11, 2020

Choose a reason for hiding this comment

enxebre commented May 11, 2020

JoelSpeed commented May 11, 2020

JoelSpeed commented May 11, 2020

openshift-ci-robot commented Jun 11, 2020

JoelSpeed commented Jun 11, 2020

openshift-ci-robot commented Jun 11, 2020

elmiko commented Jun 11, 2020

openshift-ci-robot commented Jun 11, 2020

JoelSpeed commented Jun 15, 2020

openshift-cherrypick-robot commented Jun 15, 2020

JoelSpeed commented May 7, 2020 •

edited

Loading

enxebre May 11, 2020 •

edited

Loading