Custom k8s scheduler support for Karpenter e.g., Apache YuniKorn, Volcano #742

vara-bonthu · 2022-09-16T16:33:56Z

Tell us about your request

Add Karpenter support to work with custom schedulers (e.g., Apache Yunikorn, Volcano)
As per my understanding, Karpenter works only with the default scheduler to schedule the pods. However, It's prevalent among the Data on Kubernetes community to use custom schedulers like Apache YuniKorn or Volcano for running Spark jobs on Amazon EKS.
With the requested feature, Karpenter is effectively used as Autoscaler for spinning up new nodes while YuniKorn and Volcano handle scheduling decisions.

Please correct me and provide some context if this feature is already supported.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Using Apache YuniKorn/Volcano is becoming a basic requirement for running batch workloads(e.g., Spark) on Kubernetes. These schedulers are more application aware unlike default scheduler and provides number of other useful features(e.g., resource queues, sorting the jobs) for running multi-tenant data workloads on Kubernetes(Amazon EKS).

At this moment we could only use cluster autoscaler with these custom schedulers but it would be beneficial to add Karpenter support to leverage performance optimised by Karpenter over Cluster Autoscaler.

Are you currently working around this issue?

No, We are using Cluster Autoscaler as an alternative option to work with custom schedulers.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

dewjam · 2022-09-23T18:26:47Z

Hey @vara-bonthu ,
Thanks for the feature request. This feature is not supported currently and is not yet on our roadmap. It sounds like this would be a pretty significant amount of effort given Karpenter would have to adhere to the scheduling decisions of multiple custom schedulers.

While I have not personally tested a custom scheduler with Karpenter, it should be able to at least launch nodes even if a custom scheduler is in use (Karpenter simply watches for pending pods and then spawns nodes accordingly). Though as you mentioned, Karpenter is built to adhere to the scheduling decisions of kube-scheduler. So it's certainly possible you would run across some cases where Karpenter makes incorrect decisions when a custom scheduler is in the mix.

If you have a configuration you could share, it would be fun to do some testing with custom schedulers to see how Karpenter responds.

ellistarn · 2022-10-10T19:15:51Z

I'd love to learn how customer schedulers make different decisions than the kube scheduler. Technically, we're agnostic of the kube scheduler, but we support the pod spec fields that impact scheduling. Do custom schedulers respect all of those fields?

Could you provide a concrete example workflow of what decisions you'd like to see karpenter make when working alongside a custom scheduler?

tgaddair · 2022-12-12T18:46:37Z

I've tested Karpenter with Volcano successfully. There was one issue (PR to fix in volcano-sh/volcano#2602) that was causing Volcano to use an unconventional Reason that prevented Karpenter from triggering scale-up, but once this PR lands things should be working again.

vara-bonthu · 2022-12-22T18:39:26Z

@ellistarn

I think the initial issue that i encountered could be due to the older version of Apache YuniKorn 0.12.1 where i installed YuniKorn as a secondary k8s scheduler. Also, I didn't enable the admission controller. This error could be caused by multiple schedulers running at the sametime. Here is the log from old tests. However the good news is that it works with latest version. Please see for more details below.

YuniKorn Scheduler Error Summary

ERROR  external/scheduler_cache.go:203 pod updated on a different node than previously added to
ERROR  external/scheduler_cache.go:204 scheduler cache is corrupted and can badly affect scheduling decisions

YuniKorn Scheduler full log

2022-01-31T17:12:08.558Z	INFO	cache/context.go:552	app added	{"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c"}
2022-01-31T17:12:08.559Z	INFO	cache/context.go:612	task added	{"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c", "taskID": "9d00be6e-8c41-4c35-a089-5b6932e339ac", "taskState": "New"}
2022-01-31T17:12:09.253Z	INFO	cache/application.go:436	handle app submission	{"app": "applicationID: spark-79df83d15b6843b7bb1cac31e7135e9c, queue: root.spark, partition: default, totalNumOfTasks: 1, currentState: Submitted", "clusterID": "mycluster"}
2022-01-31T17:12:09.254Z	INFO	placement/tag_rule.go:114	Tag rule application placed	{"application": "spark-79df83d15b6843b7bb1cac31e7135e9c", "queue": "root.spark-k8s-data-team-a"}
2022-01-31T17:12:09.254Z	INFO	objects/queue.go:150	dynamic queue added to scheduler	{"queueName": "root.spark-k8s-data-team-a"}
2022-01-31T17:12:09.254Z	INFO	scheduler/context.go:495	Added application to partition	{"applicationID": "spark-79df83d15b6843b7bb1cac31e7135e9c", "partitionName": "[mycluster]default", "requested queue": "root.spark", "placed queue": "root.spark-k8s-data-team-a"}
2022-01-31T17:12:09.254Z	INFO	callback/scheduler_callback.go:108	Accepting app	{"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c"}
2022-01-31T17:12:10.254Z	INFO	cache/application.go:531	Skip the reservation stage	{"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c"}
2022-01-31T17:12:11.256Z	INFO	objects/application_state.go:128	Application state transition	{"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c", "source": "New", "destination": "Accepted", "event": "runApplication"}
2022-01-31T17:12:11.256Z	INFO	objects/application.go:531	Ask added successfully to application	{"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c", "ask": "9d00be6e-8c41-4c35-a089-5b6932e339ac", "placeholder": false, "pendingDelta": "map[memory:12885 vcore:4000]"}
2022-01-31T17:12:15.360Z	INFO	cache/nodes.go:112	adding node to context	{"nodeName": "ip-10-1-10-119.eu-west-1.compute.internal", "nodeLabels": "{\"karpenter.sh/capacity-type\":\"spot\",\"karpenter.sh/provisioner-name\":\"default\",\"node.kubernetes.io/instance-type\":\"m5.4xlarge\",\"topology.kubernetes.io/zone\":\"eu-west-1a\"}", "schedulable": true}
2022-01-31T17:12:15.361Z	INFO	cache/node.go:148	node recovering	{"nodeID": "ip-10-1-10-119.eu-west-1.compute.internal", "schedulable": true}
2022-01-31T17:12:15.361Z	INFO	scheduler/partition.go:548	adding node to partition	{"partition": "[mycluster]default", "nodeID": "ip-10-1-10-119.eu-west-1.compute.internal"}
2022-01-31T17:12:15.362Z	INFO	scheduler/partition.go:613	Updated available resources from added node	{"partitionName": "[mycluster]default", "nodeID": "ip-10-1-10-119.eu-west-1.compute.internal", "partitionResource": "map[attachable-volumes-aws-ebs:50 ephemeral-storage:142784976248 hugepages-1Gi:0 hugepages-2Mi:0 memory:86159 pods:321 vcore:21850]"}
2022-01-31T17:12:15.363Z	INFO	scheduler/context.go:592	successfully added node	{"nodeID": "ip-10-1-10-119.eu-west-1.compute.internal", "partition": "[mycluster]default"}
2022-01-31T17:12:15.375Z	ERROR	external/scheduler_cache.go:203	pod updated on a different node than previously added to	{"pod": "9d00be6e-8c41-4c35-a089-5b6932e339ac"}
github.com/apache/incubator-yunikorn-k8shim/pkg/cache/external.(*SchedulerCache).UpdatePod
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/pkg/cache/external/scheduler_cache.go:203
github.com/apache/incubator-yunikorn-k8shim/pkg/cache.(*Context).updatePodInCache
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/pkg/cache/context.go:253
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/controller.go:238
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/controller.go:273
k8s.io/client-go/tools/cache.(*processorListener).run.func1
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/shared_informer.go:775
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73
2022-01-31T17:12:15.376Z	ERROR	external/scheduler_cache.go:204	scheduler cache is corrupted and can badly affect scheduling decisions
github.com/apache/incubator-yunikorn-k8shim/pkg/cache/external.(*SchedulerCache).UpdatePod
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/pkg/cache/external/scheduler_cache.go:204
github.com/apache/incubator-yunikorn-k8shim/pkg/cache.(*Context).updatePodInCache
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/pkg/cache/context.go:253
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/controller.go:238
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/controller.go:273
k8s.io/client-go/tools/cache.(*processorListener).run.func1
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/shared_informer.go:775
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73
2022-01-31T17:12:38.125Z	INFO	configs/configwatcher.go:143	config watcher timed out

However In my recent tests with the following setup, I can confirm that Karpenter is working well with Apache YuniKorn scheduler.

Apache YuniKorn Version: 1.1.0 deployed as default scheduler. This overrides the k8s default scheduler
Karpenter Version: v0.20.0

EMR on EKS Spark jobs are working as expected with Apache YuniKorn Gang scheduling along with Karpenter autoscaling.

If you are interested to know how to setup Karpenter with Apache YuniKorn (with Gang Scheduling) then you can refer to Data on EKS docs (https://github.com/awslabs/data-on-eks/tree/main/analytics/terraform/emr-eks-karpenter).

EMR on EKS Spark job with 1 Driver and 20 Executors with Apache YuniKorn Scheduling logs

Karpenter logs

2022-12-22T17:50:54.846Z	DEBUG	controller.aws	deleted launch template	{"commit": "f60dacd"}
2022-12-22T17:51:25.330Z	INFO	controller.provisioner	found provisionable pod(s)	{"commit": "f60dacd", "pods": 21}
2022-12-22T17:51:25.331Z	INFO	controller.provisioner	computed new node(s) to fit pod(s)	{"commit": "f60dacd", "nodes": 2, "pods": 21}
2022-12-22T17:51:25.331Z	INFO	controller.provisioner	launching node with 6 pods requesting {"cpu":"7355m","memory":"92280Mi","pods":"9"} from types r5d.4xlarge, r5d.8xlarge	{"commit": "f60dacd", "provisioner": "spark-memory-optimized"}
2022-12-22T17:51:25.351Z	INFO	controller.provisioner	launching node with 15 pods requesting {"cpu":"18155m","memory":"230520Mi","pods":"18"} from types r5d.8xlarge	{"commit": "f60dacd", "provisioner": "spark-memory-optimized"}
2022-12-22T17:51:25.773Z	DEBUG	controller.provisioner.cloudprovider	created launch template	{"commit": "f60dacd", "provisioner": "spark-memory-optimized", "launch-template-name": "Karpenter-emr-eks-karpenter-2497088801825500229", "launch-template-id": "lt-030a5b323b302d61a"}
2022-12-22T17:51:27.766Z	INFO	controller.provisioner.cloudprovider	launched new instance	{"commit": "f60dacd", "provisioner": "spark-memory-optimized", "launched-instance": "i-0ef04f248f444ec79", "hostname": "ip-10-1-126-194.us-west-2.compute.internal", "type": "r5d.4xlarge", "zone": "us-west-2b", "capacity-type": "spot"}
2022-12-22T17:51:29.941Z	INFO	controller.provisioner.cloudprovider	launched new instance	{"commit": "f60dacd", "provisioner": "spark-memory-optimized", "launched-instance": "i-0d948a89719abfacf", "hostname": "ip-10-1-67-206.us-west-2.compute.internal", "type": "r5d.8xlarge", "zone": "us-west-2b", "capacity-type": "spot"}
2022-12-22T17:59:04.511Z	INFO	controller.node	added TTL to empty node	{"commit": "f60dacd", "node": "ip-10-1-67-206.us-west-2.compute.internal"}

Apache YuniKorn placement pods triggered the nodes required to run the Spark Job by the Karpenter.

Happy to close this issue.

anovv · 2023-06-10T02:41:13Z

I've tested Karpenter with Volcano successfully. There was one issue (PR to fix in volcano-sh/volcano#2602) that was causing Volcano to use an unconventional Reason that prevented Karpenter from triggering scale-up, but once this PR lands things should be working again.

@tgaddair Do you mind sharing your experience using Volcano with cluster autoscalers? We are building a platform supporting multi tenant jobs and planning to use Volcano for gang-scheduling, however there is very limited amount of info on best practices of scaling clusters while supporting gang-scheduling, even in Volcano docs. What made you chose Karpenter over cluster-autoscaler? If you have been successfully running Volcano + Karpenter setup, what is your experience? This topic can be a separate post somewhere, this would be pretty valuable.

ghost · 2023-08-17T19:18:55Z

@tgaddair please share your Volcano + Karpenter when you get a chance. It would be very valuable for others exploring the two and save many a lot of effort.

k8s-triage-robot · 2024-01-31T18:42:43Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k1rk · 2024-02-08T11:45:17Z

/remove-lifecycle stale

k8s-triage-robot · 2024-05-08T11:56:29Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k1rk · 2024-05-13T06:43:31Z

/remove-lifecycle stale

jwcesign · 2024-05-21T08:01:34Z

Does Karpneter have plan to support something like this(generally for AI jobs)?
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md

jwcesign · 2024-05-21T08:01:56Z

cc @ellistarn

k8s-triage-robot · 2024-08-19T08:17:20Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k1rk · 2024-08-27T12:59:59Z

/remove-lifecycle stale

DmitriGekhtman · 2024-10-04T20:34:52Z

Does Karpneter have plan to support something like this(generally for AI jobs)? https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md

Plus one on this question. To get correct gang scheduling behavior with on-demand compute, the underlying cloud provider has to support "all-or-nothing" provisioning of sets of VMs. (You should be ask for N VMs in a single request and either get N or 0, but nothing in between.) GCP makes this possible with Dynamic Workload Scheduler. I'm not sure what the status is for AWS and other cloud providers.

vara-bonthu added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 16, 2022

vara-bonthu changed the title ~~Custom k8s scheduler support for Karpenter e.g, Apache YuniKorn, Volcano~~ Custom k8s scheduler support for Karpenter e.g., Apache YuniKorn, Volcano Sep 16, 2022

njtran transferred this issue from aws/karpenter-provider-aws Nov 2, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2024

garvinp-stripe mentioned this issue Feb 16, 2024

Mega Issue: Manual node provisioning #749

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 13, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 19, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom k8s scheduler support for Karpenter e.g., Apache YuniKorn, Volcano #742

Custom k8s scheduler support for Karpenter e.g., Apache YuniKorn, Volcano #742

vara-bonthu commented Sep 16, 2022 •

edited

Loading

dewjam commented Sep 23, 2022

ellistarn commented Oct 10, 2022

tgaddair commented Dec 12, 2022

vara-bonthu commented Dec 22, 2022

anovv commented Jun 10, 2023

ghost commented Aug 17, 2023

k8s-triage-robot commented Jan 31, 2024

k1rk commented Feb 8, 2024

k8s-triage-robot commented May 8, 2024

k1rk commented May 13, 2024

jwcesign commented May 21, 2024

jwcesign commented May 21, 2024

k8s-triage-robot commented Aug 19, 2024

k1rk commented Aug 27, 2024

DmitriGekhtman commented Oct 4, 2024 •

edited

Loading

Custom k8s scheduler support for Karpenter e.g., Apache YuniKorn, Volcano #742

Custom k8s scheduler support for Karpenter e.g., Apache YuniKorn, Volcano #742

Comments

vara-bonthu commented Sep 16, 2022 • edited Loading

Tell us about your request

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Additional Context

Attachments

Community Note

dewjam commented Sep 23, 2022

ellistarn commented Oct 10, 2022

tgaddair commented Dec 12, 2022

vara-bonthu commented Dec 22, 2022

anovv commented Jun 10, 2023

ghost commented Aug 17, 2023

k8s-triage-robot commented Jan 31, 2024

k1rk commented Feb 8, 2024

k8s-triage-robot commented May 8, 2024

k1rk commented May 13, 2024

jwcesign commented May 21, 2024

jwcesign commented May 21, 2024

k8s-triage-robot commented Aug 19, 2024

k1rk commented Aug 27, 2024

DmitriGekhtman commented Oct 4, 2024 • edited Loading

vara-bonthu commented Sep 16, 2022 •

edited

Loading

DmitriGekhtman commented Oct 4, 2024 •

edited

Loading