Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom k8s scheduler support for Karpenter e.g., Apache YuniKorn, Volcano #742

Open
vara-bonthu opened this issue Sep 16, 2022 · 15 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@vara-bonthu
Copy link

vara-bonthu commented Sep 16, 2022

Tell us about your request

  • Add Karpenter support to work with custom schedulers (e.g., Apache Yunikorn, Volcano)

  • As per my understanding, Karpenter works only with the default scheduler to schedule the pods. However, It's prevalent among the Data on Kubernetes community to use custom schedulers like Apache YuniKorn or Volcano for running Spark jobs on Amazon EKS.

  • With the requested feature, Karpenter is effectively used as Autoscaler for spinning up new nodes while YuniKorn and Volcano handle scheduling decisions.

Please correct me and provide some context if this feature is already supported.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Using Apache YuniKorn/Volcano is becoming a basic requirement for running batch workloads(e.g., Spark) on Kubernetes. These schedulers are more application aware unlike default scheduler and provides number of other useful features(e.g., resource queues, sorting the jobs) for running multi-tenant data workloads on Kubernetes(Amazon EKS).

At this moment we could only use cluster autoscaler with these custom schedulers but it would be beneficial to add Karpenter support to leverage performance optimised by Karpenter over Cluster Autoscaler.

Are you currently working around this issue?

No, We are using Cluster Autoscaler as an alternative option to work with custom schedulers.

Additional Context

No response

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@vara-bonthu vara-bonthu added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 16, 2022
@vara-bonthu vara-bonthu changed the title Custom k8s scheduler support for Karpenter e.g, Apache YuniKorn, Volcano Custom k8s scheduler support for Karpenter e.g., Apache YuniKorn, Volcano Sep 16, 2022
@dewjam
Copy link
Contributor

dewjam commented Sep 23, 2022

Hey @vara-bonthu ,
Thanks for the feature request. This feature is not supported currently and is not yet on our roadmap. It sounds like this would be a pretty significant amount of effort given Karpenter would have to adhere to the scheduling decisions of multiple custom schedulers.

While I have not personally tested a custom scheduler with Karpenter, it should be able to at least launch nodes even if a custom scheduler is in use (Karpenter simply watches for pending pods and then spawns nodes accordingly). Though as you mentioned, Karpenter is built to adhere to the scheduling decisions of kube-scheduler. So it's certainly possible you would run across some cases where Karpenter makes incorrect decisions when a custom scheduler is in the mix.

If you have a configuration you could share, it would be fun to do some testing with custom schedulers to see how Karpenter responds.

@ellistarn
Copy link
Contributor

I'd love to learn how customer schedulers make different decisions than the kube scheduler. Technically, we're agnostic of the kube scheduler, but we support the pod spec fields that impact scheduling. Do custom schedulers respect all of those fields?

Could you provide a concrete example workflow of what decisions you'd like to see karpenter make when working alongside a custom scheduler?

@tgaddair
Copy link

I've tested Karpenter with Volcano successfully. There was one issue (PR to fix in volcano-sh/volcano#2602) that was causing Volcano to use an unconventional Reason that prevented Karpenter from triggering scale-up, but once this PR lands things should be working again.

@vara-bonthu
Copy link
Author

@ellistarn

I think the initial issue that i encountered could be due to the older version of Apache YuniKorn 0.12.1 where i installed YuniKorn as a secondary k8s scheduler. Also, I didn't enable the admission controller. This error could be caused by multiple schedulers running at the sametime. Here is the log from old tests. However the good news is that it works with latest version. Please see for more details below.

YuniKorn Scheduler Error Summary

ERROR  external/scheduler_cache.go:203 pod updated on a different node than previously added to
ERROR  external/scheduler_cache.go:204 scheduler cache is corrupted and can badly affect scheduling decisions

YuniKorn Scheduler full log

2022-01-31T17:12:08.558Z	INFO	cache/context.go:552	app added	{"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c"}
2022-01-31T17:12:08.559Z	INFO	cache/context.go:612	task added	{"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c", "taskID": "9d00be6e-8c41-4c35-a089-5b6932e339ac", "taskState": "New"}
2022-01-31T17:12:09.253Z	INFO	cache/application.go:436	handle app submission	{"app": "applicationID: spark-79df83d15b6843b7bb1cac31e7135e9c, queue: root.spark, partition: default, totalNumOfTasks: 1, currentState: Submitted", "clusterID": "mycluster"}
2022-01-31T17:12:09.254Z	INFO	placement/tag_rule.go:114	Tag rule application placed	{"application": "spark-79df83d15b6843b7bb1cac31e7135e9c", "queue": "root.spark-k8s-data-team-a"}
2022-01-31T17:12:09.254Z	INFO	objects/queue.go:150	dynamic queue added to scheduler	{"queueName": "root.spark-k8s-data-team-a"}
2022-01-31T17:12:09.254Z	INFO	scheduler/context.go:495	Added application to partition	{"applicationID": "spark-79df83d15b6843b7bb1cac31e7135e9c", "partitionName": "[mycluster]default", "requested queue": "root.spark", "placed queue": "root.spark-k8s-data-team-a"}
2022-01-31T17:12:09.254Z	INFO	callback/scheduler_callback.go:108	Accepting app	{"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c"}
2022-01-31T17:12:10.254Z	INFO	cache/application.go:531	Skip the reservation stage	{"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c"}
2022-01-31T17:12:11.256Z	INFO	objects/application_state.go:128	Application state transition	{"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c", "source": "New", "destination": "Accepted", "event": "runApplication"}
2022-01-31T17:12:11.256Z	INFO	objects/application.go:531	Ask added successfully to application	{"appID": "spark-79df83d15b6843b7bb1cac31e7135e9c", "ask": "9d00be6e-8c41-4c35-a089-5b6932e339ac", "placeholder": false, "pendingDelta": "map[memory:12885 vcore:4000]"}
2022-01-31T17:12:15.360Z	INFO	cache/nodes.go:112	adding node to context	{"nodeName": "ip-10-1-10-119.eu-west-1.compute.internal", "nodeLabels": "{\"karpenter.sh/capacity-type\":\"spot\",\"karpenter.sh/provisioner-name\":\"default\",\"node.kubernetes.io/instance-type\":\"m5.4xlarge\",\"topology.kubernetes.io/zone\":\"eu-west-1a\"}", "schedulable": true}
2022-01-31T17:12:15.361Z	INFO	cache/node.go:148	node recovering	{"nodeID": "ip-10-1-10-119.eu-west-1.compute.internal", "schedulable": true}
2022-01-31T17:12:15.361Z	INFO	scheduler/partition.go:548	adding node to partition	{"partition": "[mycluster]default", "nodeID": "ip-10-1-10-119.eu-west-1.compute.internal"}
2022-01-31T17:12:15.362Z	INFO	scheduler/partition.go:613	Updated available resources from added node	{"partitionName": "[mycluster]default", "nodeID": "ip-10-1-10-119.eu-west-1.compute.internal", "partitionResource": "map[attachable-volumes-aws-ebs:50 ephemeral-storage:142784976248 hugepages-1Gi:0 hugepages-2Mi:0 memory:86159 pods:321 vcore:21850]"}
2022-01-31T17:12:15.363Z	INFO	scheduler/context.go:592	successfully added node	{"nodeID": "ip-10-1-10-119.eu-west-1.compute.internal", "partition": "[mycluster]default"}
2022-01-31T17:12:15.375Z	ERROR	external/scheduler_cache.go:203	pod updated on a different node than previously added to	{"pod": "9d00be6e-8c41-4c35-a089-5b6932e339ac"}
github.com/apache/incubator-yunikorn-k8shim/pkg/cache/external.(*SchedulerCache).UpdatePod
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/pkg/cache/external/scheduler_cache.go:203
github.com/apache/incubator-yunikorn-k8shim/pkg/cache.(*Context).updatePodInCache
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/pkg/cache/context.go:253
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/controller.go:238
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/controller.go:273
k8s.io/client-go/tools/cache.(*processorListener).run.func1
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/shared_informer.go:775
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73
2022-01-31T17:12:15.376Z	ERROR	external/scheduler_cache.go:204	scheduler cache is corrupted and can badly affect scheduling decisions
github.com/apache/incubator-yunikorn-k8shim/pkg/cache/external.(*SchedulerCache).UpdatePod
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/pkg/cache/external/scheduler_cache.go:204
github.com/apache/incubator-yunikorn-k8shim/pkg/cache.(*Context).updatePodInCache
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/pkg/cache/context.go:253
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/controller.go:238
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/controller.go:273
k8s.io/client-go/tools/cache.(*processorListener).run.func1
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/shared_informer.go:775
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1
	/Users/cyu/go/src/github.com/apache/incubator-yunikorn-k8shim/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73
2022-01-31T17:12:38.125Z	INFO	configs/configwatcher.go:143	config watcher timed out

However In my recent tests with the following setup, I can confirm that Karpenter is working well with Apache YuniKorn scheduler.

Apache YuniKorn Version: 1.1.0 deployed as default scheduler. This overrides the k8s default scheduler
Karpenter Version: v0.20.0

EMR on EKS Spark jobs are working as expected with Apache YuniKorn Gang scheduling along with Karpenter autoscaling.

If you are interested to know how to setup Karpenter with Apache YuniKorn (with Gang Scheduling) then you can refer to Data on EKS docs (https://github.com/awslabs/data-on-eks/tree/main/analytics/terraform/emr-eks-karpenter).

EMR on EKS Spark job with 1 Driver and 20 Executors with Apache YuniKorn Scheduling logs

Karpenter logs

2022-12-22T17:50:54.846Z	DEBUG	controller.aws	deleted launch template	{"commit": "f60dacd"}
2022-12-22T17:51:25.330Z	INFO	controller.provisioner	found provisionable pod(s)	{"commit": "f60dacd", "pods": 21}
2022-12-22T17:51:25.331Z	INFO	controller.provisioner	computed new node(s) to fit pod(s)	{"commit": "f60dacd", "nodes": 2, "pods": 21}
2022-12-22T17:51:25.331Z	INFO	controller.provisioner	launching node with 6 pods requesting {"cpu":"7355m","memory":"92280Mi","pods":"9"} from types r5d.4xlarge, r5d.8xlarge	{"commit": "f60dacd", "provisioner": "spark-memory-optimized"}
2022-12-22T17:51:25.351Z	INFO	controller.provisioner	launching node with 15 pods requesting {"cpu":"18155m","memory":"230520Mi","pods":"18"} from types r5d.8xlarge	{"commit": "f60dacd", "provisioner": "spark-memory-optimized"}
2022-12-22T17:51:25.773Z	DEBUG	controller.provisioner.cloudprovider	created launch template	{"commit": "f60dacd", "provisioner": "spark-memory-optimized", "launch-template-name": "Karpenter-emr-eks-karpenter-2497088801825500229", "launch-template-id": "lt-030a5b323b302d61a"}
2022-12-22T17:51:27.766Z	INFO	controller.provisioner.cloudprovider	launched new instance	{"commit": "f60dacd", "provisioner": "spark-memory-optimized", "launched-instance": "i-0ef04f248f444ec79", "hostname": "ip-10-1-126-194.us-west-2.compute.internal", "type": "r5d.4xlarge", "zone": "us-west-2b", "capacity-type": "spot"}
2022-12-22T17:51:29.941Z	INFO	controller.provisioner.cloudprovider	launched new instance	{"commit": "f60dacd", "provisioner": "spark-memory-optimized", "launched-instance": "i-0d948a89719abfacf", "hostname": "ip-10-1-67-206.us-west-2.compute.internal", "type": "r5d.8xlarge", "zone": "us-west-2b", "capacity-type": "spot"}
2022-12-22T17:59:04.511Z	INFO	controller.node	added TTL to empty node	{"commit": "f60dacd", "node": "ip-10-1-67-206.us-west-2.compute.internal"}

Apache YuniKorn placement pods triggered the nodes required to run the Spark Job by the Karpenter.

image

Happy to close this issue.

@anovv
Copy link

anovv commented Jun 10, 2023

I've tested Karpenter with Volcano successfully. There was one issue (PR to fix in volcano-sh/volcano#2602) that was causing Volcano to use an unconventional Reason that prevented Karpenter from triggering scale-up, but once this PR lands things should be working again.

@tgaddair Do you mind sharing your experience using Volcano with cluster autoscalers? We are building a platform supporting multi tenant jobs and planning to use Volcano for gang-scheduling, however there is very limited amount of info on best practices of scaling clusters while supporting gang-scheduling, even in Volcano docs. What made you chose Karpenter over cluster-autoscaler? If you have been successfully running Volcano + Karpenter setup, what is your experience? This topic can be a separate post somewhere, this would be pretty valuable.

@ghost
Copy link

ghost commented Aug 17, 2023

@tgaddair please share your Volcano + Karpenter when you get a chance. It would be very valuable for others exploring the two and save many a lot of effort.

@njtran njtran transferred this issue from aws/karpenter-provider-aws Nov 2, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024
@k1rk
Copy link

k1rk commented Feb 8, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2024
@k1rk
Copy link

k1rk commented May 13, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 13, 2024
@jwcesign
Copy link
Contributor

Does Karpneter have plan to support something like this(generally for AI jobs)?
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md

@jwcesign
Copy link
Contributor

cc @ellistarn

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 19, 2024
@k1rk
Copy link

k1rk commented Aug 27, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 27, 2024
@DmitriGekhtman
Copy link

DmitriGekhtman commented Oct 4, 2024

Does Karpneter have plan to support something like this(generally for AI jobs)? https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md

Plus one on this question. To get correct gang scheduling behavior with on-demand compute, the underlying cloud provider has to support "all-or-nothing" provisioning of sets of VMs. (You should be ask for N VMs in a single request and either get N or 0, but nothing in between.) GCP makes this possible with Dynamic Workload Scheduler. I'm not sure what the status is for AWS and other cloud providers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

10 participants