fix: permit custom device requests #2161

universam1 · 2022-07-21T05:07:30Z

Description

Karpenter is negative towards custom device requests it is unaware of, erroneously assuming those cannot be scheduled. Instead, ignore unmanaged resources from Karpenter perspective in best effort.

This changes the request handling to be scoped only to resource request that Karpenter is actively managing.
The reasoning here is that it cannot influence those resource requests anyways, they come into existence by means of other concepts such as the device plugin manager that might be late bound and thus out of Karpenter scope.
Hence, instead of being negative, it should be neutral regarding unknown resources.

It is related with the recent concept change not to assign pods to a node but let control plane perform the final scheduling which takes into account runtime available resources.

How was this change tested?

integration tests in AWS lab account

Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened: #
No

Release Note

fix: supporting custom device requests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2022-07-21T05:07:35Z

✅ Deploy Preview for karpenter-docs-prod canceled.

Name	Link
🔨 Latest commit	`b0a4520`
🔍 Latest deploy log	https://app.netlify.com/sites/karpenter-docs-prod/deploys/62da324ba4665a0008c4af22

universam1 · 2022-07-21T05:13:14Z

@ellistarn created a suggested fix as requested - tested it in lab which works and solves our use case with custom devices. Of course this is a naive implementation being new to the project.
Thanks for feedback

ellistarn · 2022-07-22T03:06:47Z

Thanks for the PR @universam1. The main concern I'd have with this feature is that it's possible to get into scheduling loops if the node fails to populate unknown custom resources. We need to make sure that both the scheduling.Node and scheduling.InFlightNode agree about the schedulability of the pod.

If for some reason the custom resources don't get advertised on the node, we either need to terminate the node (via initialization failure https://github.com/aws/karpenter/blob/main/pkg/controllers/node/initialization.go#L72) or leave the node around (and the pods would stay stuck pending).

I'm curious what @tzneal thinks about this.

As mentioned on the issue, I'd really like to explore the design space of doing this properly (via a configmap or CRD). Perhaps @billrayburn can help us with the roadmap on this.

Karpenter is negative towards custom device requests it is unaware of, assuming those cannot be scheduled. fixes #1900 This changes the request handling to be scoped only to resource request that Karpenter is aware of and actively manages. The reasoning here is that it cannot influence those resource requests anyways, they come into existance by means of other concepts such as the device-plugin manager that even might be late bound and thus it is out of scope.

universam1 · 2022-07-22T09:37:50Z

Thank you @ellistarn for reviewing and the provided context, appreciated, also for the hint with the scheduling.InFlightNode. This is improved and I added a unit test. Confirmed that it works in our lab clusters.

it's possible to get into scheduling loops if the node fails to populate unknown custom resources.

Valid point, however I cannot see a difference to a whitelisted (CRD) concept for this issue. If a request is whitelisted and the resource is not populated, the same condition of a potential loop would arise, no? Isn't this the case already for GPU requests where this can happen too?

doing this properly (via a configmap or CRD)

If I may reason that properly™ might look from the operators perspective differently:

having an extra whitelist is apparently overhead. I as operator must form then a contract with the devs about a provisioner they are to use, which requires now another place to keep in sync. The deployment needs to request a specific provisioner to work with Karpenter where it just works OOB with ClusterAutoscaler. This PR demonstrates this overhead is not necessary
ClusterAutoscaler does this transparently, gracefully ignoring unknown requests, without a need for a specific whitelist. So it falls behind
whitelist requires CRD change which might be harder to get going than without?

Eventually this missing support for custom devices is a blocking issue for us and we are happy to help getting it solved, thanks!

ellistarn · 2022-07-22T22:39:37Z

The deployment needs to request a specific provisioner to work with Karpenter where it just works OOB with ClusterAutoscaler. This PR demonstrates this overhead is not necessary

IIUC, this is because cluster autoscaler works by discovering extended resources on node groups from preexisting nodes. If you want to scale from zero (which karpenter does), you have to annotate this using ASG tags.

ClusterAutoscaler does this transparently, gracefully ignoring unknown requests, without a need for a specific whitelist. So it falls behind

Do you have a pointer to this? I'm less familiar.

ellistarn · 2022-07-22T22:45:50Z

pkg/controllers/provisioning/scheduling/suite_test.go

 				},
 			}))
 		// requesting a resource of quantity zero of a type unsupported by any instance is fine
 		ExpectScheduled(ctx, env.Client, pod[0])
 	})
+	It("should be neutral regarding custom, unknown resource requests", func() {


Suggested change

It("should be neutral regarding custom, unknown resource requests", func() {

It("should ignore unknown resource requests", func() {

ellistarn · 2022-07-22T22:47:11Z

I think this may be able to be achieved by modifying the Fits function

// Fits returns true if the candidate set of resources is less than or equal to the total set of resources.
func Fits(candidate, resources v1.ResourceList) bool {
	for resourceName, quantity := range candidate {
                // -----------------------------------------> If unknown, don't even check and fall through to true below
		if available, ok := resources[resourceName]; ok && Cmp(quantity, available) > 0 {
			return false
		}
	}
	return true
}

ellistarn · 2022-07-22T22:50:50Z

As mentioned above, the other drawback I can think of is that we can't track node initialization of extended resources, so we will be unable to terminate automatically fix a node that comes online and cannot register its extended resource due to a hardware failure or similar. Additionally, you'll need to ensure that your ttlSecondsAfterEmpty is set high enough such that the node isn't terminated, since we'll start counting the node as initialized before it is.

ellistarn · 2022-07-24T21:59:37Z

The other drawback I just realized is that if we consider existing or in flight capacity to have infinite of a custom resource, karpenter will fail to provision beyond the first node (unless other resources are full on the node). If we assume infinite for new nodes and zero for existing, then karpenter will launch nodes as fast as it can while the pod remains pending.

I don't see a maintainable path forward with the assumption as proposed, as there are enough common cases that break.

I'd like to pursue an instance type override approach, but this has some caveats with multi-os clusters. It may be worth bringing this idea to working group to discuss further.

universam1 · 2022-07-28T12:20:32Z

I don't see a maintainable path forward with the assumption as proposed, as there are enough common cases that break.

I understand that it might not be the perfect solution, but it is certainly unblocking the showstopper for now. In our case we are running this fork in production clusters, otherwise we'd have to rollback from Karpenter.
The issues you mentioned might appear but do not happen at least for us. One could see this as an iterative solution as it is not involving any API changes and thus there is any path forward possible...

I'd like to pursue an instance type override approach,

What time frame are you considering, since this is a blocking issue for certain users?

ellistarn · 2022-07-28T16:24:50Z

I understand that it might not be the perfect solution, but it is certainly unblocking the showstopper for now. In our case we are running this fork in production clusters, otherwise we'd have to rollback from Karpenter.

Can you share some setup instructions and I can try to reproduce?

universam1 · 2022-07-29T12:24:47Z

Can you share some setup instructions and I can try to reproduce?

Eks cluster, containerD node runtime. Run smarter-device daemonset setup to provide fuse device. Run podman pod unprivileged with volume device request of /dev/fuse. Build a container inside podman pod.
Need a spec file?

ellistarn · 2022-07-29T16:23:08Z

Need a spec file?

Any yaml / links you can share would help me get to this faster :D.

universam1 · 2022-07-30T16:04:27Z

Any yaml / links you can share would help me get to this faster :D.

Sure @ellistarn this would be the test setup:

EKS
containerD node runtime
deploy this smarter-device-manager and the test pod

spec file

apiVersion: v1
kind: ConfigMap
metadata:
  name: smarter-device-manager
  namespace: kube-system
data:
  conf.yaml: |
    - devicematch: ^fuse$
      nummaxdevices: 1000
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: smarter-device-manager
  namespace: kube-system
  labels:
    name: smarter-device-manager
spec:
  selector:
    matchLabels:
      name: smarter-device-manager
  template:
    metadata:
      labels:
        name: smarter-device-manager
    spec:
      hostname: smarter-device-management
      hostNetwork: true
      containers:
      - name: smarter-device-manager
        image: registry.gitlab.com/arm-research/smarter/smarter-device-manager:v1.20.10
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
          - name: dev-dir
            mountPath: /dev
          - name: sys-dir
            mountPath: /sys
          - name: config
            mountPath: /root/config
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: dev-dir
          hostPath:
            path: /dev
        - name: sys-dir
          hostPath:
            path: /sys
        - name: config
          configMap:
            name: smarter-device-manager
---
apiVersion: v1
kind: Pod
metadata:
  name: testpod
  namespace: default
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: karpenter.sh/provisioner-name
            operator: Exists
  containers:
    - name: testpod
      image: quay.io/podman/stable:latest
      command: ["/bin/sh"]
      args:
        - -c
        - >-
            id &&
            echo 'FROM alpine' >/tmp/Dockerfile &&
            echo 'RUN id' >>/tmp/Dockerfile &&
            echo 'RUN apk add -U curl' >>/tmp/Dockerfile &&
            echo 'RUN curl --version' >>/tmp/Dockerfile &&
            podman build -f /tmp/Dockerfile .
      resources:
        limits:
          smarter-devices/fuse: 1
      securityContext:
        runAsUser: 1000
        runAsGroup: 1000
        privileged: false

ellistarn · 2022-08-01T21:25:04Z

Just as a heads up, I'm about to go on vacation. Trying to find an owner for this.

jonathan-innis · 2022-08-18T18:13:48Z

Hi @universam1 , we are currently working on a long-term solution to this by allowing users to be able to specify these kind of custom device requests at the instance type level as @ellistarn has mentioned. We will have a design doc out shortly around how we are thinking about achieving this and will share it with the community. Will update this thread when we have the doc out, would love to receive your feedback on the PR

universam1 · 2022-08-19T18:07:31Z

Will update this thread when we have the doc out, would love to receive your feedback on the PR

Thank you @jonathan-innis for the heads up and the feedback about the road map. Looking forward to your solution, happy to test it out!

ellistarn · 2022-10-10T20:51:43Z

Closing in favor of #2390

cvpatel · 2023-04-10T18:02:01Z

As the long-term solution may take some time, as discussed in #2390, any chance this or some variation of it can be revived as an intermediate solution?

@ellistarn @universam1

universam1 · 2023-04-19T09:48:33Z

As the long-term solution may take some time, as discussed in #2390, any chance this or some variation of it can be revived as an intermediate solution?

@ellistarn @jonathan-innis Trying to get a feeling for the time to expect for your long-term solution. We are wondering if we should maintain a fork with an intermediate solution for the time being or if there are other options possible?
As an idea, could we guard this naive implementation behind an alpha flag to unblock those users willing to take the risk for the time being? Would that be a feasable approach?

jonathan-innis · 2023-04-19T22:48:36Z

unblock those users willing to take the risk for the time being

I think my biggest concern here is with the deprovisioning algorithm when assuming unlimited resources. Consolidation would assume that it could consolidate more pods down into a single node than it might actually be able to, at which point it would effectively hang since not all pods that we anticipated we could schedule to the new node would actually be able to schedule. @universam1 I'm curious what you have seen in your environment when running your forked version with consolidation. Have you seen consolidation just sit there after making a bad decision based on the infinite resource assumption or at least overschedule nodes for the number of pods on the cluster?

Another potential here remedy is assuming the extended resource has some fixed value by default (like 1) but again this doesn't work as a one-size-fits-all solution for users as there are definitely other extended resources and workloads that will need to request more than one of a resource and; therefore, never schedule.

There's of course still the more complicated "specify all extended resource type mappings to instance types" solution but we're a little wary of introducing a more complex solution that people will most likely take dependency on with something that we know probably doesn't work as the long-term robust solution

universam1 requested a review from a team as a code owner July 21, 2022 05:07

universam1 requested a review from spring1843 July 21, 2022 05:07

universam1 force-pushed the custom-devices branch 2 times, most recently from d03dd68 to e25be74 Compare July 22, 2022 05:05

universam1 force-pushed the custom-devices branch from e25be74 to b0a4520 Compare July 22, 2022 05:14

ellistarn reviewed Jul 22, 2022

View reviewed changes

This was referenced Aug 29, 2022

RFC: Add instance type settings design proposal #2390

Closed

feat: Introduce InstanceType CRD for overriding instance type resources #2404

Closed

dewjam assigned jonathan-innis Oct 6, 2022

ellistarn closed this Oct 10, 2022

chomatdam mentioned this pull request Feb 29, 2024

feat: add ignore resource requests arg kubernetes-sigs/karpenter#1056

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: permit custom device requests #2161

fix: permit custom device requests #2161

universam1 commented Jul 21, 2022 •

edited

Loading

netlify bot commented Jul 21, 2022 •

edited

Loading

universam1 commented Jul 21, 2022

ellistarn commented Jul 22, 2022

universam1 commented Jul 22, 2022

ellistarn commented Jul 22, 2022 •

edited

Loading

ellistarn Jul 22, 2022

ellistarn commented Jul 22, 2022

ellistarn commented Jul 22, 2022

ellistarn commented Jul 24, 2022

universam1 commented Jul 28, 2022

ellistarn commented Jul 28, 2022

universam1 commented Jul 29, 2022 •

edited

Loading

ellistarn commented Jul 29, 2022

universam1 commented Jul 30, 2022

ellistarn commented Aug 1, 2022

jonathan-innis commented Aug 18, 2022

universam1 commented Aug 19, 2022

ellistarn commented Oct 10, 2022

cvpatel commented Apr 10, 2023

universam1 commented Apr 19, 2023

jonathan-innis commented Apr 19, 2023

	It("should be neutral regarding custom, unknown resource requests", func() {
	It("should ignore unknown resource requests", func() {

fix: permit custom device requests #2161

fix: permit custom device requests #2161

Conversation

universam1 commented Jul 21, 2022 • edited Loading

netlify bot commented Jul 21, 2022 • edited Loading

✅ Deploy Preview for karpenter-docs-prod canceled.

universam1 commented Jul 21, 2022

ellistarn commented Jul 22, 2022

universam1 commented Jul 22, 2022

ellistarn commented Jul 22, 2022 • edited Loading

ellistarn Jul 22, 2022

Choose a reason for hiding this comment

ellistarn commented Jul 22, 2022

ellistarn commented Jul 22, 2022

ellistarn commented Jul 24, 2022

universam1 commented Jul 28, 2022

ellistarn commented Jul 28, 2022

universam1 commented Jul 29, 2022 • edited Loading

ellistarn commented Jul 29, 2022

universam1 commented Jul 30, 2022

ellistarn commented Aug 1, 2022

jonathan-innis commented Aug 18, 2022

universam1 commented Aug 19, 2022

ellistarn commented Oct 10, 2022

cvpatel commented Apr 10, 2023

universam1 commented Apr 19, 2023

jonathan-innis commented Apr 19, 2023

universam1 commented Jul 21, 2022 •

edited

Loading

netlify bot commented Jul 21, 2022 •

edited

Loading

ellistarn commented Jul 22, 2022 •

edited

Loading

universam1 commented Jul 29, 2022 •

edited

Loading