-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow small tolerance on memory capacity when comparing nodegroups #3124
Allow small tolerance on memory capacity when comparing nodegroups #3124
Conversation
|
||
var ( | ||
// MaxMemoryDifference describes how much memory capacity can differ but still be considered equal. | ||
MaxMemoryDifference = resource.MustParse("256Mi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving discussion from openshift#152 (comment).
Wouldn't this MaxMemoryDifference also apply to much more smaller instances to the point of making the check loosing its value?
i.e If the possible diff range increase with the instance size, should we may be make our tolerance window a percentage of the given total size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it would. For this example the difference was just over 1%, so perhaps making it percentage based would be better. I think we will need to investigate a few more real world examples to come up with a sensible way to do this
This lgtm. Can we drop the first commit and include the reasoning about choosing 0.015 ratio in the final commit desc? |
2595cee
to
25cf22c
Compare
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this lgtm, thanks Joel!
/approve
hehe, guess i'm not an approver on this part. so i'll just add =) |
We encountered unbalanced scaling on AWS EKS due to memory differences in the underlying nodes. I backported the changes in this PR to |
I am thinking the potential diff for large instances. Could you explain the magic number you use here? I think it's more for a specific case? Is there a way to make sure it covers most of the cases? Sorry, I don't have lots of data points now. @JoelSpeed |
@Jeffwan I posted some details in a comment on a related issue about some observations I had made. In particular, the main culprits for memory differences are m5, c5 and r5 instances based on my experimenting and conversations with others. The differences I have managed to observe are approximately 1%, independent of the size of the instances (eg I saw a 1.05% difference on an instance with 256Gi of memory). I chose 1.5% to allow a small tolerance on the differences I was seeing, but I didn't want to make it the 5% difference that the other values allow as this seemed massively overkill for the real world results. Having spoken to an engineer from AWS, they have told me that they suspect the issue comes from the fact that some 4th and 5th gen instance types are actually a mixture of hardware specs with slight differences on them. This coupled with the Nitro hypervisor on the 5th gen instances can accentuate the apparent differences in the hardware to the OS (it gives a more realistic picture I think). Unfortunately due to the way that AWS is working with these instances types, you can see this approximate 1% difference in memory across instances of the same type, both within and across availability zones. A potential further fix would be to not compare memory/cpu etc for nodegroups backed by the same instance type, but that feels like it will lead to other issues. |
@JoelSpeed I think it makes sense to me and also looks good to me on AWS. However, this logic is shared and used by others as well. Not sure it's acceptable to anyone else? There's another option is to rewrite this a comparator in https://github.com/kubernetes/autoscaler/blob/972e30a5d9eece175a54fa5dfc0ed902b34f02b1/cluster-autoscaler/processors/nodegroupset/aws_nodegroups.go. @losipiuk @aleksandra-malinowska @vivekbagade opnions? |
That is an alternative, but I'm actually coming at this from the perspective of the clusterapi provider. We would need to maintain an independent comparator for CAPI that basically is the same as the standard one, but with this one minor difference to cater for the AWS case. If that's what we must do then I can do that, but I'd prefer not to have a separate CAPI comparator Also, I'm not sure how that would work since the the comparator just changes the labels presently, we would have to change a lot to allow this difference, plumbing through tolerances for different providers may work? |
Yeah, I don't see data points from other cloud providers. If there's a big concern, we might need to move to aws comparator. If not, let's keep the current implementation and I think it makes sense. |
I mentioned this PR on the community call earlier today, it was suggested that I ping approvers from the other providers to see if they have any strong opposition to this. Starting a lazy consensus, if there is no opposition to this by end of day on the 15th, then we should seek approval and get this merged Ping @ringtail @feiskyer @nilo19 @marwanad @hello2mao @andrewsykim @tghartland, do you have concerns about this? |
I agree. But I think there should also be a magic number for GPU. |
So should we set up a better way to solve such problem. Magic number may not be the best choice. |
@ringtail Why would there be a magic number for GPU, I haven't got much experience with them personally but I assumed the amount of GPUs would always be a small discrete number, are you suggesting there can be 1.05 GPUs attached to a system? |
} | ||
larger := math.Max(float64(qtyList[0].MilliValue()), float64(qtyList[1].MilliValue())) | ||
smaller := math.Min(float64(qtyList[0].MilliValue()), float64(qtyList[1].MilliValue())) | ||
if larger-smaller > larger*maxDifferenceRatio { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: return larger-smaller <= larger * maxDifferenceRatio ?
return false | ||
} | ||
} | ||
return true | ||
} | ||
|
||
func compareResourceListWithTolerance(qtyList []resource.Quantity, maxDifferenceRatio float64) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a minor detail, but for a function that returns a bool I'd prefer a name that makes it obvious what true/false means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to keep consistent with the compareResourceMapsWithTolerance
, though I agree, it's not a great name. If I rename one, should I rename the other too? How about resourceMapsWithinTolerance
and resourceListWithinTolerance
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point on consistency. Renaming both sgtm.
As discussed on sig meeting I'm fine with this change and I will approve it once lazy consensus is reached. Re: GPUs - I also thought GPU number is discrete? Also in general this part of logic is designed to be easily replaceable if a provider has specific requirements. There are already implementations of NodeInfoComparator for AWS and Azure. They both call IsCloudProviderNodeInfoSimilar with different parameters, but that is not a hard requirement. I don't see a strong reason for why we couldn't have more customized comparators if needed. |
In testing, AWS M5 instances can on occasion display approximately a 1% difference in memory capacity between availability zones, deployed with the same launch configuration and same AMI. Allow a 1.5% tolerance to give some buffer on the actual amount of memory discrepancy since in testing, some examples were just over 1% (eg 1.05%, 1.1%). Tests are included with capacity values taken from real instances to prevent future regression.
25cf22c
to
be1d9cb
Compare
@MaciekPytel I resolved your two comments, PTAL |
Will this be backported to |
@MaciekPytel @elmiko Can we get this merged now, there's been no objections in the last two weeks? |
@JoelSpeed thanks for the reminder, i believe we decided at the meeting 2 weeks ago to give this 1 week lazy consensus, so yeah we should move forward. |
oh, oops. forgot i can't approve here. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: elmiko, MaciekPytel The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick release-1.18 This did not seem to work, guess we have to manually do it |
… when comparing nodegroups
… when comparing nodegroups
… when comparing nodegroups
[CA-1.17] Cherry-pick #3124: Allow small tolerance on memory capacity when comparing nodegroups
[CA-1.18] Cherry-pick #3124: Allow small tolerance on memory capacity when comparing nodegroups
Cherry-pick #3124: Allow small tolerance on memory capacity when comparing nodegroups
… when comparing nodegroups
i think this problem might be rearing its head again, i am seeing balance failures when using the clustapi / azure provider combination, it's rejecting due to memory differences. |
This allows a small tolerance in the memory capacity of nodes to allow better matching of similar node groups. There are differences in the memory values that Kubernetes interprets due to variances in the instances that a cloud provider provides.
Also adds tests that match real values from a real set of nodes that would be expected to be the same (the same instance type across multiple availability zones within a given region)
Eg. In testing I saw AWS m5.xlarge nodes with capacities such as
16116152Ki
and15944120Ki
not only across availability zones, but within the same availability zone after a few cycles through machines. This is a difference on168Mi
which is much larger than the original tolerance of 128000 Bytes which was preventingBalanceSimilarNodeGroups
from balancing across these availability zones.