-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom k8s scheduler support for Karpenter e.g., Apache YuniKorn, Volcano #742
Comments
Hey @vara-bonthu , While I have not personally tested a custom scheduler with Karpenter, it should be able to at least launch nodes even if a custom scheduler is in use (Karpenter simply watches for pending pods and then spawns nodes accordingly). Though as you mentioned, Karpenter is built to adhere to the scheduling decisions of kube-scheduler. So it's certainly possible you would run across some cases where Karpenter makes incorrect decisions when a custom scheduler is in the mix. If you have a configuration you could share, it would be fun to do some testing with custom schedulers to see how Karpenter responds. |
I'd love to learn how customer schedulers make different decisions than the kube scheduler. Technically, we're agnostic of the kube scheduler, but we support the pod spec fields that impact scheduling. Do custom schedulers respect all of those fields? Could you provide a concrete example workflow of what decisions you'd like to see karpenter make when working alongside a custom scheduler? |
I've tested Karpenter with Volcano successfully. There was one issue (PR to fix in volcano-sh/volcano#2602) that was causing Volcano to use an unconventional |
I think the initial issue that i encountered could be due to the older version of YuniKorn Scheduler Error Summary
YuniKorn Scheduler full log
However In my recent tests with the following setup, I can confirm that Karpenter is working well with Apache YuniKorn scheduler.
EMR on EKS Spark jobs are working as expected with Apache YuniKorn Gang scheduling along with Karpenter autoscaling. If you are interested to know how to setup Karpenter with Apache YuniKorn (with Gang Scheduling) then you can refer to Data on EKS docs (https://github.com/awslabs/data-on-eks/tree/main/analytics/terraform/emr-eks-karpenter). EMR on EKS Spark job with 1 Driver and 20 Executors with Apache YuniKorn Scheduling logs Karpenter logs
Apache YuniKorn placement pods triggered the nodes required to run the Spark Job by the Karpenter. Happy to close this issue. |
@tgaddair Do you mind sharing your experience using Volcano with cluster autoscalers? We are building a platform supporting multi tenant jobs and planning to use Volcano for gang-scheduling, however there is very limited amount of info on best practices of scaling clusters while supporting gang-scheduling, even in Volcano docs. What made you chose Karpenter over cluster-autoscaler? If you have been successfully running Volcano + Karpenter setup, what is your experience? This topic can be a separate post somewhere, this would be pretty valuable. |
@tgaddair please share your Volcano + Karpenter when you get a chance. It would be very valuable for others exploring the two and save many a lot of effort. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Does Karpneter have plan to support something like this(generally for AI jobs)? |
cc @ellistarn |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Plus one on this question. To get correct gang scheduling behavior with on-demand compute, the underlying cloud provider has to support "all-or-nothing" provisioning of sets of VMs. (You should be ask for N VMs in a single request and either get N or 0, but nothing in between.) GCP makes this possible with Dynamic Workload Scheduler. I'm not sure what the status is for AWS and other cloud providers. |
Tell us about your request
Add Karpenter support to work with custom schedulers (e.g., Apache Yunikorn, Volcano)
As per my understanding, Karpenter works only with the default scheduler to schedule the pods. However, It's prevalent among the Data on Kubernetes community to use custom schedulers like Apache YuniKorn or Volcano for running Spark jobs on Amazon EKS.
With the requested feature, Karpenter is effectively used as Autoscaler for spinning up new nodes while YuniKorn and Volcano handle scheduling decisions.
Please correct me and provide some context if this feature is already supported.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Using Apache YuniKorn/Volcano is becoming a basic requirement for running batch workloads(e.g., Spark) on Kubernetes. These schedulers are more application aware unlike default scheduler and provides number of other useful features(e.g., resource queues, sorting the jobs) for running multi-tenant data workloads on Kubernetes(Amazon EKS).
At this moment we could only use cluster autoscaler with these custom schedulers but it would be beneficial to add Karpenter support to leverage performance optimised by Karpenter over Cluster Autoscaler.
Are you currently working around this issue?
No, We are using Cluster Autoscaler as an alternative option to work with custom schedulers.
Additional Context
No response
Attachments
No response
Community Note
The text was updated successfully, but these errors were encountered: