fix: Skip 'hugepages' resources on ceiling #206

DanielJuravski · 2023-02-07T12:54:38Z

Description
Following the issue above, this fix/feature resolves the above resource incompatibility.
Karpenter manages the nodes/instance types based on the resources keys/values it gets from the cloud provider, hence HW resources (i.e. cpu, memory, gpu, etc.). It can not manage SW resources - as hugepages (https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-types) because they instance based rather than cloud provider based.
Karpenter just doesn't familiar with that resource type (and fails to deal with it when it appears in a workload.yaml), that's why I drop/filter this resource, as it can't be handled anyways via Karpenter.

Please tell if this change is not in the right place or if it might break something.

How was this change tested?
Cluster created, Provisioner resource applied, built a karpenter-controller image using my karpenter-core change and helm upgraded my cluster. Afterwards applied the yaml presented in the issue above, and node was allocated!

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

jonathan-innis · 2023-02-08T00:37:24Z

pkg/utils/resources/resources.go

@@ -51,6 +51,9 @@ func Merge(resources ...v1.ResourceList) v1.ResourceList {
 	result := make(v1.ResourceList, len(resources[0]))
 	for _, resourceList := range resources {
 		for resourceName, quantity := range resourceList {
+			if resourceName == "hugepages-2Mi" || resourceName == "hugepages-1Gi" {


We had a similar ask here but we had to reject the idea because ignoring the resource means that we can't know the best instance type to launch your workload pod on. You can imagine that we launch an instance type which doesn't actually support the resource, or we schedule a node that doesn't have enough of the resource to support your pod. Ideally, we would be able to return the exact resources each InstanceType has through the GetInstanceTypes API

@jonathan-innis Thanks for the review.
Regarding -

You can imagine that we launch an instance type which doesn't actually support the resource

That's what I tried to describe above, such resource as hugepages isn't a HW based (instance type based) resource, but a SW based resource, i.e. you'll never know if your node/instance type supports the resource.

Anyway, how that behavior can be achieved? We want to migrate to Karpenter but we have to support 'hugepages' resource.

Thank you.

It is a SW-based component, but I believe that it scales based on the size of the instance, which means it's difficult to achieve a fixed sizing that we could pass through. Is this something that should be consistent on your kernel across all instance types or just the few that you've associated with a given image and NodeTemplate?

The second option.

@jonathan-innis, since its a specific request for a specific instance type (DL1) which is always needed (there is no point running habanlabs sw pod on DL1 without hugepages - it will not work), What if we add this condition/ignore case only for DL1 instance type will that be, Ok?

Is it safe to assume hugepages to be some large number in instance type provider? Is this something that we should specify in the AWSNodeTemplate or Provisioner.spec.resources?

@ellistarn generally speaking - Provisioner.spec.resources

jonathan-innis · 2023-04-24T23:06:50Z

Hi @DanielJuravski, we definitely see supporting extended resources as a priority prior to v1 of Karpenter (we have the issue tracked here for reference #751); however, the Karpenter maintainer team is currently prioritizing some other work right now prior to addressing the extended resource issue. We're planning to drop an RFC that addresses extended resources later down the line, but since we are unsure how this is going to look in the near future, we're not going to be able to accept this PR right now.

ellistarn · 2023-05-15T21:20:24Z

Closing in favor of #751

Skip 'hugepages' resources on ceiling

11fde12

DanielJuravski requested a review from a team as a code owner February 7, 2023 12:54

DanielJuravski requested a review from engedaam February 7, 2023 12:54

jonathan-innis reviewed Feb 8, 2023

View reviewed changes

jonathan-innis added the blocked Unable to make progress due to some dependency label Apr 24, 2023

ellistarn closed this May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Skip 'hugepages' resources on ceiling #206

fix: Skip 'hugepages' resources on ceiling #206

DanielJuravski commented Feb 7, 2023

jonathan-innis Feb 8, 2023

DanielJuravski Feb 8, 2023

jonathan-innis Feb 8, 2023

DanielJuravski Feb 9, 2023

ghost Feb 9, 2023

ellistarn Feb 10, 2023

DanielJuravski Feb 12, 2023

jonathan-innis commented Apr 24, 2023

ellistarn commented May 15, 2023

fix: Skip 'hugepages' resources on ceiling #206

fix: Skip 'hugepages' resources on ceiling #206

Conversation

DanielJuravski commented Feb 7, 2023

jonathan-innis Feb 8, 2023

Choose a reason for hiding this comment

DanielJuravski Feb 8, 2023

Choose a reason for hiding this comment

jonathan-innis Feb 8, 2023

Choose a reason for hiding this comment

DanielJuravski Feb 9, 2023

Choose a reason for hiding this comment

ghost Feb 9, 2023

Choose a reason for hiding this comment

ellistarn Feb 10, 2023

Choose a reason for hiding this comment

DanielJuravski Feb 12, 2023

Choose a reason for hiding this comment

jonathan-innis commented Apr 24, 2023

ellistarn commented May 15, 2023