Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow small tolerance on memory capacity when comparing nodegroups #3124

Merged
merged 1 commit into from
Jun 24, 2020

Conversation

JoelSpeed
Copy link
Contributor

@JoelSpeed JoelSpeed commented May 11, 2020

This allows a small tolerance in the memory capacity of nodes to allow better matching of similar node groups. There are differences in the memory values that Kubernetes interprets due to variances in the instances that a cloud provider provides.

Also adds tests that match real values from a real set of nodes that would be expected to be the same (the same instance type across multiple availability zones within a given region)

Eg. In testing I saw AWS m5.xlarge nodes with capacities such as 16116152Ki and 15944120Ki not only across availability zones, but within the same availability zone after a few cycles through machines. This is a difference on 168Mi which is much larger than the original tolerance of 128000 Bytes which was preventing BalanceSimilarNodeGroups from balancing across these availability zones.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 11, 2020

var (
// MaxMemoryDifference describes how much memory capacity can differ but still be considered equal.
MaxMemoryDifference = resource.MustParse("256Mi")
Copy link
Member

@enxebre enxebre May 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving discussion from openshift#152 (comment).

Wouldn't this MaxMemoryDifference also apply to much more smaller instances to the point of making the check loosing its value?
i.e If the possible diff range increase with the instance size, should we may be make our tolerance window a percentage of the given total size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it would. For this example the difference was just over 1%, so perhaps making it percentage based would be better. I think we will need to investigate a few more real world examples to come up with a sensible way to do this

@enxebre
Copy link
Member

enxebre commented May 13, 2020

This lgtm. Can we drop the first commit and include the reasoning about choosing 0.015 ratio in the final commit desc?

@JoelSpeed JoelSpeed force-pushed the memory-tolerance-quantity branch from 2595cee to 25cf22c Compare May 13, 2020 09:56
@enxebre
Copy link
Member

enxebre commented May 13, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 13, 2020
@JoelSpeed JoelSpeed changed the title Use quantities for memory capacity differences Allow small tolerance on memory capacity when comparing nodegroups May 13, 2020
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this lgtm, thanks Joel!
/approve

@elmiko
Copy link
Contributor

elmiko commented May 13, 2020

hehe, guess i'm not an approver on this part. so i'll just add
/lgtm

=)

@wwentland
Copy link

wwentland commented May 14, 2020

We encountered unbalanced scaling on AWS EKS due to memory differences in the underlying nodes.

I backported the changes in this PR to 1.16.5 which allowed CA to balance scaling events between all nodegroups with similar instance types (as expected).

@JoelSpeed
Copy link
Contributor Author

CC @Jeffwan @jaypipes, It was suggested to me that you may have opinions on this PR

@Jeffwan
Copy link
Contributor

Jeffwan commented May 16, 2020

I am thinking the potential diff for large instances. Could you explain the magic number you use here? I think it's more for a specific case? Is there a way to make sure it covers most of the cases? Sorry, I don't have lots of data points now. @JoelSpeed

@JoelSpeed
Copy link
Contributor Author

@Jeffwan I posted some details in a comment on a related issue about some observations I had made.

In particular, the main culprits for memory differences are m5, c5 and r5 instances based on my experimenting and conversations with others. The differences I have managed to observe are approximately 1%, independent of the size of the instances (eg I saw a 1.05% difference on an instance with 256Gi of memory). I chose 1.5% to allow a small tolerance on the differences I was seeing, but I didn't want to make it the 5% difference that the other values allow as this seemed massively overkill for the real world results.

Having spoken to an engineer from AWS, they have told me that they suspect the issue comes from the fact that some 4th and 5th gen instance types are actually a mixture of hardware specs with slight differences on them. This coupled with the Nitro hypervisor on the 5th gen instances can accentuate the apparent differences in the hardware to the OS (it gives a more realistic picture I think). Unfortunately due to the way that AWS is working with these instances types, you can see this approximate 1% difference in memory across instances of the same type, both within and across availability zones.

A potential further fix would be to not compare memory/cpu etc for nodegroups backed by the same instance type, but that feels like it will lead to other issues.

@Jeffwan
Copy link
Contributor

Jeffwan commented May 18, 2020

@JoelSpeed I think it makes sense to me and also looks good to me on AWS. However, this logic is shared and used by others as well. Not sure it's acceptable to anyone else? There's another option is to rewrite this a comparator in https://github.com/kubernetes/autoscaler/blob/972e30a5d9eece175a54fa5dfc0ed902b34f02b1/cluster-autoscaler/processors/nodegroupset/aws_nodegroups.go.

@losipiuk @aleksandra-malinowska @vivekbagade opnions?

@JoelSpeed
Copy link
Contributor Author

Not sure it's acceptable to anyone else? There's another option is to rewrite this a comparator in https://github.com/kubernetes/autoscaler/blob/972e30a5d9eece175a54fa5dfc0ed902b34f02b1/cluster-autoscaler/processors/nodegroupset/aws_nodegroups.go.

That is an alternative, but I'm actually coming at this from the perspective of the clusterapi provider. We would need to maintain an independent comparator for CAPI that basically is the same as the standard one, but with this one minor difference to cater for the AWS case. If that's what we must do then I can do that, but I'd prefer not to have a separate CAPI comparator

Also, I'm not sure how that would work since the the comparator just changes the labels presently, we would have to change a lot to allow this difference, plumbing through tolerances for different providers may work?

@Jeffwan
Copy link
Contributor

Jeffwan commented May 19, 2020

@JoelSpeed

Yeah, I don't see data points from other cloud providers. If there's a big concern, we might need to move to aws comparator. If not, let's keep the current implementation and I think it makes sense.

@JoelSpeed
Copy link
Contributor Author

I mentioned this PR on the community call earlier today, it was suggested that I ping approvers from the other providers to see if they have any strong opposition to this. Starting a lazy consensus, if there is no opposition to this by end of day on the 15th, then we should seek approval and get this merged

Ping @ringtail @feiskyer @nilo19 @marwanad @hello2mao @andrewsykim @tghartland, do you have concerns about this?

@ringtail
Copy link
Contributor

I mentioned this PR on the community call earlier today, it was suggested that I ping approvers from the other providers to see if they have any strong opposition to this. Starting a lazy consensus, if there is no opposition to this by end of day on the 15th, then we should seek approval and get this merged

Ping @ringtail @feiskyer @nilo19 @marwanad @hello2mao @andrewsykim @tghartland, do you have concerns about this?

I agree. But I think there should also be a magic number for GPU.

@ringtail
Copy link
Contributor

I mentioned this PR on the community call earlier today, it was suggested that I ping approvers from the other providers to see if they have any strong opposition to this. Starting a lazy consensus, if there is no opposition to this by end of day on the 15th, then we should seek approval and get this merged
Ping @ringtail @feiskyer @nilo19 @marwanad @hello2mao @andrewsykim @tghartland, do you have concerns about this?

I agree. But I think there should also be a magic number for GPU.

So should we set up a better way to solve such problem. Magic number may not be the best choice.

@JoelSpeed
Copy link
Contributor Author

@ringtail Why would there be a magic number for GPU, I haven't got much experience with them personally but I assumed the amount of GPUs would always be a small discrete number, are you suggesting there can be 1.05 GPUs attached to a system?

}
larger := math.Max(float64(qtyList[0].MilliValue()), float64(qtyList[1].MilliValue()))
smaller := math.Min(float64(qtyList[0].MilliValue()), float64(qtyList[1].MilliValue()))
if larger-smaller > larger*maxDifferenceRatio {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return larger-smaller <= larger * maxDifferenceRatio ?

return false
}
}
return true
}

func compareResourceListWithTolerance(qtyList []resource.Quantity, maxDifferenceRatio float64) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a minor detail, but for a function that returns a bool I'd prefer a name that makes it obvious what true/false means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to keep consistent with the compareResourceMapsWithTolerance, though I agree, it's not a great name. If I rename one, should I rename the other too? How about resourceMapsWithinTolerance and resourceListWithinTolerance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point on consistency. Renaming both sgtm.

@MaciekPytel
Copy link
Contributor

As discussed on sig meeting I'm fine with this change and I will approve it once lazy consensus is reached.

Re: GPUs - I also thought GPU number is discrete?

Also in general this part of logic is designed to be easily replaceable if a provider has specific requirements. There are already implementations of NodeInfoComparator for AWS and Azure. They both call IsCloudProviderNodeInfoSimilar with different parameters, but that is not a hard requirement. I don't see a strong reason for why we couldn't have more customized comparators if needed.

In testing, AWS M5 instances can on occasion display approximately a 1% difference
in memory capacity between availability zones, deployed with the same launch
configuration and same AMI.
Allow a 1.5% tolerance to give some buffer on the actual amount of memory discrepancy
since in testing, some examples were just over 1% (eg 1.05%, 1.1%).
Tests are included with capacity values taken from real instances to prevent future
regression.
@JoelSpeed JoelSpeed force-pushed the memory-tolerance-quantity branch from 25cf22c to be1d9cb Compare June 10, 2020 11:00
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 10, 2020
@JoelSpeed
Copy link
Contributor Author

@MaciekPytel I resolved your two comments, PTAL

@DWSR
Copy link

DWSR commented Jun 23, 2020

Will this be backported to 1.16 as that's the latest version supported on EKS?

@JoelSpeed
Copy link
Contributor Author

@MaciekPytel @elmiko Can we get this merged now, there's been no objections in the last two weeks?

@elmiko
Copy link
Contributor

elmiko commented Jun 23, 2020

@JoelSpeed thanks for the reminder, i believe we decided at the meeting 2 weeks ago to give this 1 week lazy consensus, so yeah we should move forward.
/approve

@elmiko
Copy link
Contributor

elmiko commented Jun 23, 2020

oh, oops. forgot i can't approve here.
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2020
@MaciekPytel
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 24, 2020
@k8s-ci-robot k8s-ci-robot merged commit 67dce2e into kubernetes:master Jun 24, 2020
@JoelSpeed JoelSpeed deleted the memory-tolerance-quantity branch June 30, 2020 09:08
@JoelSpeed
Copy link
Contributor Author

JoelSpeed commented Jun 30, 2020

/cherry-pick release-1.18

This did not seem to work, guess we have to manually do it

wwentland pushed a commit to wwentland/autoscaler that referenced this pull request Jul 26, 2020
wwentland pushed a commit to wwentland/autoscaler that referenced this pull request Jul 26, 2020
wwentland pushed a commit to wwentland/autoscaler that referenced this pull request Jul 26, 2020
k8s-ci-robot added a commit that referenced this pull request Jul 27, 2020
[CA-1.17] Cherry-pick #3124: Allow small tolerance on memory capacity when comparing nodegroups
k8s-ci-robot added a commit that referenced this pull request Jul 27, 2020
[CA-1.18] Cherry-pick #3124: Allow small tolerance on memory capacity when comparing nodegroups
k8s-ci-robot added a commit that referenced this pull request Jul 27, 2020
Cherry-pick #3124: Allow small tolerance on memory capacity when comparing nodegroups
colin-welch pushed a commit to Paperspace/autoscaler that referenced this pull request Mar 5, 2021
@elmiko
Copy link
Contributor

elmiko commented Feb 21, 2023

i think this problem might be rearing its head again, i am seeing balance failures when using the clustapi / azure provider combination, it's rejecting due to memory differences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants