Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: permit custom device requests #2161

Closed
wants to merge 1 commit into from
Closed

Conversation

universam1
Copy link

@universam1 universam1 commented Jul 21, 2022

Fixes kubernetes-sigs/karpenter#751

Description

Karpenter is negative towards custom device requests it is unaware of, erroneously assuming those cannot be scheduled. Instead, ignore unmanaged resources from Karpenter perspective in best effort.

This changes the request handling to be scoped only to resource request that Karpenter is actively managing.
The reasoning here is that it cannot influence those resource requests anyways, they come into existence by means of other concepts such as the device plugin manager that might be late bound and thus out of Karpenter scope.
Hence, instead of being negative, it should be neutral regarding unknown resources.

It is related with the recent concept change not to assign pods to a node but let control plane perform the final scheduling which takes into account runtime available resources.

How was this change tested?

  • integration tests in AWS lab account

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

Release Note

fix: supporting custom device requests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@universam1 universam1 requested a review from a team as a code owner July 21, 2022 05:07
@universam1 universam1 requested a review from spring1843 July 21, 2022 05:07
@netlify
Copy link

netlify bot commented Jul 21, 2022

Deploy Preview for karpenter-docs-prod canceled.

Name Link
🔨 Latest commit b0a4520
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/62da324ba4665a0008c4af22

@universam1
Copy link
Author

@ellistarn created a suggested fix as requested - tested it in lab which works and solves our use case with custom devices. Of course this is a naive implementation being new to the project.
Thanks for feedback

@ellistarn
Copy link
Contributor

Thanks for the PR @universam1. The main concern I'd have with this feature is that it's possible to get into scheduling loops if the node fails to populate unknown custom resources. We need to make sure that both the scheduling.Node and scheduling.InFlightNode agree about the schedulability of the pod.

If for some reason the custom resources don't get advertised on the node, we either need to terminate the node (via initialization failure https://github.com/aws/karpenter/blob/main/pkg/controllers/node/initialization.go#L72) or leave the node around (and the pods would stay stuck pending).

I'm curious what @tzneal thinks about this.

As mentioned on the issue, I'd really like to explore the design space of doing this properly (via a configmap or CRD). Perhaps @billrayburn can help us with the roadmap on this.

@universam1 universam1 force-pushed the custom-devices branch 2 times, most recently from d03dd68 to e25be74 Compare July 22, 2022 05:05
Karpenter is negative towards custom device requests it is unaware of, assuming those cannot be scheduled.

fixes #1900

This changes the request handling to be scoped only to resource request that Karpenter is aware of and actively manages. The reasoning here is that it cannot influence those resource requests anyways, they come into existance by means of other concepts such as the device-plugin manager that even might be late bound and thus it is out of scope.
@universam1
Copy link
Author

Thank you @ellistarn for reviewing and the provided context, appreciated, also for the hint with the scheduling.InFlightNode. This is improved and I added a unit test. Confirmed that it works in our lab clusters.

it's possible to get into scheduling loops if the node fails to populate unknown custom resources.

Valid point, however I cannot see a difference to a whitelisted (CRD) concept for this issue. If a request is whitelisted and the resource is not populated, the same condition of a potential loop would arise, no? Isn't this the case already for GPU requests where this can happen too?

doing this properly (via a configmap or CRD)

If I may reason that properly™ might look from the operators perspective differently:

  • having an extra whitelist is apparently overhead. I as operator must form then a contract with the devs about a provisioner they are to use, which requires now another place to keep in sync. The deployment needs to request a specific provisioner to work with Karpenter where it just works OOB with ClusterAutoscaler. This PR demonstrates this overhead is not necessary
  • ClusterAutoscaler does this transparently, gracefully ignoring unknown requests, without a need for a specific whitelist. So it falls behind
  • whitelist requires CRD change which might be harder to get going than without?

Eventually this missing support for custom devices is a blocking issue for us and we are happy to help getting it solved, thanks!

@ellistarn
Copy link
Contributor

ellistarn commented Jul 22, 2022

The deployment needs to request a specific provisioner to work with Karpenter where it just works OOB with ClusterAutoscaler. This PR demonstrates this overhead is not necessary

IIUC, this is because cluster autoscaler works by discovering extended resources on node groups from preexisting nodes. If you want to scale from zero (which karpenter does), you have to annotate this using ASG tags.

ClusterAutoscaler does this transparently, gracefully ignoring unknown requests, without a need for a specific whitelist. So it falls behind

Do you have a pointer to this? I'm less familiar.

},
}))
// requesting a resource of quantity zero of a type unsupported by any instance is fine
ExpectScheduled(ctx, env.Client, pod[0])
})
It("should be neutral regarding custom, unknown resource requests", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It("should be neutral regarding custom, unknown resource requests", func() {
It("should ignore unknown resource requests", func() {

@ellistarn
Copy link
Contributor

I think this may be able to be achieved by modifying the Fits function

// Fits returns true if the candidate set of resources is less than or equal to the total set of resources.
func Fits(candidate, resources v1.ResourceList) bool {
	for resourceName, quantity := range candidate {
                // -----------------------------------------> If unknown, don't even check and fall through to true below
		if available, ok := resources[resourceName]; ok && Cmp(quantity, available) > 0 {
			return false
		}
	}
	return true
}

@ellistarn
Copy link
Contributor

As mentioned above, the other drawback I can think of is that we can't track node initialization of extended resources, so we will be unable to terminate automatically fix a node that comes online and cannot register its extended resource due to a hardware failure or similar. Additionally, you'll need to ensure that your ttlSecondsAfterEmpty is set high enough such that the node isn't terminated, since we'll start counting the node as initialized before it is.

@ellistarn
Copy link
Contributor

The other drawback I just realized is that if we consider existing or in flight capacity to have infinite of a custom resource, karpenter will fail to provision beyond the first node (unless other resources are full on the node). If we assume infinite for new nodes and zero for existing, then karpenter will launch nodes as fast as it can while the pod remains pending.

I don't see a maintainable path forward with the assumption as proposed, as there are enough common cases that break.

I'd like to pursue an instance type override approach, but this has some caveats with multi-os clusters. It may be worth bringing this idea to working group to discuss further.

@universam1
Copy link
Author

I don't see a maintainable path forward with the assumption as proposed, as there are enough common cases that break.

I understand that it might not be the perfect solution, but it is certainly unblocking the showstopper for now. In our case we are running this fork in production clusters, otherwise we'd have to rollback from Karpenter.
The issues you mentioned might appear but do not happen at least for us. One could see this as an iterative solution as it is not involving any API changes and thus there is any path forward possible...

I'd like to pursue an instance type override approach,

What time frame are you considering, since this is a blocking issue for certain users?

@ellistarn
Copy link
Contributor

I understand that it might not be the perfect solution, but it is certainly unblocking the showstopper for now. In our case we are running this fork in production clusters, otherwise we'd have to rollback from Karpenter.

Can you share some setup instructions and I can try to reproduce?

@universam1
Copy link
Author

universam1 commented Jul 29, 2022

Can you share some setup instructions and I can try to reproduce?

Eks cluster, containerD node runtime. Run smarter-device daemonset setup to provide fuse device. Run podman pod unprivileged with volume device request of /dev/fuse. Build a container inside podman pod.
Need a spec file?

@ellistarn
Copy link
Contributor

Need a spec file?

Any yaml / links you can share would help me get to this faster :D.

@universam1
Copy link
Author

Any yaml / links you can share would help me get to this faster :D.

Sure @ellistarn this would be the test setup:

  • EKS
  • containerD node runtime
  • deploy this smarter-device-manager and the test pod
spec file
apiVersion: v1
kind: ConfigMap
metadata:
  name: smarter-device-manager
  namespace: kube-system
data:
  conf.yaml: |
    - devicematch: ^fuse$
      nummaxdevices: 1000
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: smarter-device-manager
  namespace: kube-system
  labels:
    name: smarter-device-manager
spec:
  selector:
    matchLabels:
      name: smarter-device-manager
  template:
    metadata:
      labels:
        name: smarter-device-manager
    spec:
      hostname: smarter-device-management
      hostNetwork: true
      containers:
      - name: smarter-device-manager
        image: registry.gitlab.com/arm-research/smarter/smarter-device-manager:v1.20.10
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
          - name: dev-dir
            mountPath: /dev
          - name: sys-dir
            mountPath: /sys
          - name: config
            mountPath: /root/config
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: dev-dir
          hostPath:
            path: /dev
        - name: sys-dir
          hostPath:
            path: /sys
        - name: config
          configMap:
            name: smarter-device-manager
---
apiVersion: v1
kind: Pod
metadata:
  name: testpod
  namespace: default
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: karpenter.sh/provisioner-name
            operator: Exists
  containers:
    - name: testpod
      image: quay.io/podman/stable:latest
      command: ["/bin/sh"]
      args:
        - -c
        - >-
            id &&
            echo 'FROM alpine' >/tmp/Dockerfile &&
            echo 'RUN id' >>/tmp/Dockerfile &&
            echo 'RUN apk add -U curl' >>/tmp/Dockerfile &&
            echo 'RUN curl --version' >>/tmp/Dockerfile &&
            podman build -f /tmp/Dockerfile .
      resources:
        limits:
          smarter-devices/fuse: 1
      securityContext:
        runAsUser: 1000
        runAsGroup: 1000
        privileged: false

@ellistarn
Copy link
Contributor

Just as a heads up, I'm about to go on vacation. Trying to find an owner for this.

@jonathan-innis
Copy link
Contributor

Hi @universam1 , we are currently working on a long-term solution to this by allowing users to be able to specify these kind of custom device requests at the instance type level as @ellistarn has mentioned. We will have a design doc out shortly around how we are thinking about achieving this and will share it with the community. Will update this thread when we have the doc out, would love to receive your feedback on the PR

@universam1
Copy link
Author

Will update this thread when we have the doc out, would love to receive your feedback on the PR

Thank you @jonathan-innis for the heads up and the feedback about the road map. Looking forward to your solution, happy to test it out!

@ellistarn
Copy link
Contributor

Closing in favor of #2390

@ellistarn ellistarn closed this Oct 10, 2022
@cvpatel
Copy link

cvpatel commented Apr 10, 2023

As the long-term solution may take some time, as discussed in #2390, any chance this or some variation of it can be revived as an intermediate solution?

@ellistarn @universam1

@universam1
Copy link
Author

As the long-term solution may take some time, as discussed in #2390, any chance this or some variation of it can be revived as an intermediate solution?

@ellistarn @jonathan-innis Trying to get a feeling for the time to expect for your long-term solution. We are wondering if we should maintain a fork with an intermediate solution for the time being or if there are other options possible?
As an idea, could we guard this naive implementation behind an alpha flag to unblock those users willing to take the risk for the time being? Would that be a feasable approach?

@jonathan-innis
Copy link
Contributor

unblock those users willing to take the risk for the time being

I think my biggest concern here is with the deprovisioning algorithm when assuming unlimited resources. Consolidation would assume that it could consolidate more pods down into a single node than it might actually be able to, at which point it would effectively hang since not all pods that we anticipated we could schedule to the new node would actually be able to schedule. @universam1 I'm curious what you have seen in your environment when running your forked version with consolidation. Have you seen consolidation just sit there after making a bad decision based on the infinite resource assumption or at least overschedule nodes for the number of pods on the cluster?

Another potential here remedy is assuming the extended resource has some fixed value by default (like 1) but again this doesn't work as a one-size-fits-all solution for users as there are definitely other extended resources and workloads that will need to request more than one of a resource and; therefore, never schedule.

There's of course still the more complicated "specify all extended resource type mappings to instance types" solution but we're a little wary of introducing a more complex solution that people will most likely take dependency on with something that we know probably doesn't work as the long-term robust solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mega Issue: Karpenter doesnt support custom resources requests/limit
4 participants