-
Notifications
You must be signed in to change notification settings - Fork 988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: permit custom device requests #2161
Conversation
✅ Deploy Preview for karpenter-docs-prod canceled.
|
@ellistarn created a suggested fix as requested - tested it in lab which works and solves our use case with custom devices. Of course this is a naive implementation being new to the project. |
Thanks for the PR @universam1. The main concern I'd have with this feature is that it's possible to get into scheduling loops if the node fails to populate unknown custom resources. We need to make sure that both the scheduling.Node and scheduling.InFlightNode agree about the schedulability of the pod. If for some reason the custom resources don't get advertised on the node, we either need to terminate the node (via initialization failure https://github.com/aws/karpenter/blob/main/pkg/controllers/node/initialization.go#L72) or leave the node around (and the pods would stay stuck pending). I'm curious what @tzneal thinks about this. As mentioned on the issue, I'd really like to explore the design space of doing this properly (via a configmap or CRD). Perhaps @billrayburn can help us with the roadmap on this. |
d03dd68
to
e25be74
Compare
Karpenter is negative towards custom device requests it is unaware of, assuming those cannot be scheduled. fixes #1900 This changes the request handling to be scoped only to resource request that Karpenter is aware of and actively manages. The reasoning here is that it cannot influence those resource requests anyways, they come into existance by means of other concepts such as the device-plugin manager that even might be late bound and thus it is out of scope.
e25be74
to
b0a4520
Compare
Thank you @ellistarn for reviewing and the provided context, appreciated, also for the hint with the scheduling.InFlightNode. This is improved and I added a unit test. Confirmed that it works in our lab clusters.
Valid point, however I cannot see a difference to a whitelisted (CRD) concept for this issue. If a request is whitelisted and the resource is not populated, the same condition of a potential loop would arise, no? Isn't this the case already for GPU requests where this can happen too?
If I may reason that properly™ might look from the operators perspective differently:
Eventually this missing support for custom devices is a blocking issue for us and we are happy to help getting it solved, thanks! |
IIUC, this is because cluster autoscaler works by discovering extended resources on node groups from preexisting nodes. If you want to scale from zero (which karpenter does), you have to annotate this using ASG tags.
Do you have a pointer to this? I'm less familiar. |
}, | ||
})) | ||
// requesting a resource of quantity zero of a type unsupported by any instance is fine | ||
ExpectScheduled(ctx, env.Client, pod[0]) | ||
}) | ||
It("should be neutral regarding custom, unknown resource requests", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It("should be neutral regarding custom, unknown resource requests", func() { | |
It("should ignore unknown resource requests", func() { |
I think this may be able to be achieved by modifying the
|
As mentioned above, the other drawback I can think of is that we can't track node initialization of extended resources, so we will be unable to terminate automatically fix a node that comes online and cannot register its extended resource due to a hardware failure or similar. Additionally, you'll need to ensure that your |
The other drawback I just realized is that if we consider existing or in flight capacity to have infinite of a custom resource, karpenter will fail to provision beyond the first node (unless other resources are full on the node). If we assume infinite for new nodes and zero for existing, then karpenter will launch nodes as fast as it can while the pod remains pending. I don't see a maintainable path forward with the assumption as proposed, as there are enough common cases that break. I'd like to pursue an instance type override approach, but this has some caveats with multi-os clusters. It may be worth bringing this idea to working group to discuss further. |
I understand that it might not be the perfect solution, but it is certainly unblocking the showstopper for now. In our case we are running this fork in production clusters, otherwise we'd have to rollback from Karpenter.
What time frame are you considering, since this is a blocking issue for certain users? |
Can you share some setup instructions and I can try to reproduce? |
Eks cluster, containerD node runtime. Run smarter-device daemonset setup to provide fuse device. Run podman pod unprivileged with volume device request of /dev/fuse. Build a container inside podman pod. |
Any yaml / links you can share would help me get to this faster :D. |
Sure @ellistarn this would be the test setup:
spec fileapiVersion: v1
kind: ConfigMap
metadata:
name: smarter-device-manager
namespace: kube-system
data:
conf.yaml: |
- devicematch: ^fuse$
nummaxdevices: 1000
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: smarter-device-manager
namespace: kube-system
labels:
name: smarter-device-manager
spec:
selector:
matchLabels:
name: smarter-device-manager
template:
metadata:
labels:
name: smarter-device-manager
spec:
hostname: smarter-device-management
hostNetwork: true
containers:
- name: smarter-device-manager
image: registry.gitlab.com/arm-research/smarter/smarter-device-manager:v1.20.10
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: dev-dir
mountPath: /dev
- name: sys-dir
mountPath: /sys
- name: config
mountPath: /root/config
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: dev-dir
hostPath:
path: /dev
- name: sys-dir
hostPath:
path: /sys
- name: config
configMap:
name: smarter-device-manager
---
apiVersion: v1
kind: Pod
metadata:
name: testpod
namespace: default
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: karpenter.sh/provisioner-name
operator: Exists
containers:
- name: testpod
image: quay.io/podman/stable:latest
command: ["/bin/sh"]
args:
- -c
- >-
id &&
echo 'FROM alpine' >/tmp/Dockerfile &&
echo 'RUN id' >>/tmp/Dockerfile &&
echo 'RUN apk add -U curl' >>/tmp/Dockerfile &&
echo 'RUN curl --version' >>/tmp/Dockerfile &&
podman build -f /tmp/Dockerfile .
resources:
limits:
smarter-devices/fuse: 1
securityContext:
runAsUser: 1000
runAsGroup: 1000
privileged: false |
Just as a heads up, I'm about to go on vacation. Trying to find an owner for this. |
Hi @universam1 , we are currently working on a long-term solution to this by allowing users to be able to specify these kind of custom device requests at the instance type level as @ellistarn has mentioned. We will have a design doc out shortly around how we are thinking about achieving this and will share it with the community. Will update this thread when we have the doc out, would love to receive your feedback on the PR |
Thank you @jonathan-innis for the heads up and the feedback about the road map. Looking forward to your solution, happy to test it out! |
Closing in favor of #2390 |
As the long-term solution may take some time, as discussed in #2390, any chance this or some variation of it can be revived as an intermediate solution? |
@ellistarn @jonathan-innis Trying to get a feeling for the time to expect for your long-term solution. We are wondering if we should maintain a fork with an intermediate solution for the time being or if there are other options possible? |
I think my biggest concern here is with the deprovisioning algorithm when assuming unlimited resources. Consolidation would assume that it could consolidate more pods down into a single node than it might actually be able to, at which point it would effectively hang since not all pods that we anticipated we could schedule to the new node would actually be able to schedule. @universam1 I'm curious what you have seen in your environment when running your forked version with consolidation. Have you seen consolidation just sit there after making a bad decision based on the infinite resource assumption or at least overschedule nodes for the number of pods on the cluster? Another potential here remedy is assuming the extended resource has some fixed value by default (like 1) but again this doesn't work as a one-size-fits-all solution for users as there are definitely other extended resources and workloads that will need to request more than one of a resource and; therefore, never schedule. There's of course still the more complicated "specify all extended resource type mappings to instance types" solution but we're a little wary of introducing a more complex solution that people will most likely take dependency on with something that we know probably doesn't work as the long-term robust solution |
Fixes kubernetes-sigs/karpenter#751
Description
Karpenter is negative towards custom device requests it is unaware of, erroneously assuming those cannot be scheduled. Instead, ignore unmanaged resources from Karpenter perspective in best effort.
This changes the request handling to be scoped only to resource request that Karpenter is actively managing.
The reasoning here is that it cannot influence those resource requests anyways, they come into existence by means of other concepts such as the device plugin manager that might be late bound and thus out of Karpenter scope.
Hence, instead of being negative, it should be neutral regarding unknown resources.
It is related with the recent concept change not to assign pods to a node but let control plane perform the final scheduling which takes into account runtime available resources.
How was this change tested?
Does this change impact docs?
Release Note
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.