-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support gang scheduling with Yunikorn #2107
Support gang scheduling with Yunikorn #2107
Conversation
dc9f8ec
to
79d662e
Compare
79d662e
to
e7d9cc5
Compare
24293a4
to
357c67b
Compare
Still working on this, just sorting out some issues with converting the java byte string to Kubernetes resource values. Have re-read through the upstream code and have a better approach in mind |
Signed-off-by: Jacob Salway <[email protected]>
4fea133
to
474e617
Compare
Signed-off-by: Jacob Salway <[email protected]>
Stacked PR also available to allow the default batch scheduler to be set if not specified by the user jacobsalway#1 |
e325e34
to
7200fb1
Compare
Signed-off-by: Jacob Salway <[email protected]>
7200fb1
to
fea0243
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contributions, I have left some comments.
internal/scheduler/yunikorn/maps.go
Outdated
func mergeMaps(m1, m2 map[string]string) map[string]string { | ||
out := make(map[string]string) | ||
|
||
maps.Copy(out, m1) | ||
maps.Copy(out, m2) | ||
|
||
// Return nil if there are no entries in the map so that the field is skipped during JSON marshalling | ||
if len(out) == 0 { | ||
return nil | ||
} | ||
return out | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the mergeMaps
func is an util function, maybe we can move it to pkg/utill/util.go
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the function into scheduler.go
and renamed it to mergeNodeSelector
because I think that better describes what it does. Since it returns nil if the map is empty to make the marshalled result a bit nicer, I'm not sure if that would be expected behaviour for a global utils function
func NumInitialExecutors(app *v1beta2.SparkApplication) int32 { | ||
initialExecutors := int32(0) | ||
|
||
// Take the max of these three fields while guarding against nil pointers | ||
if app.Spec.Executor.Instances != nil { | ||
initialExecutors = max(initialExecutors, *app.Spec.Executor.Instances) | ||
} | ||
if app.Spec.DynamicAllocation != nil { | ||
if app.Spec.DynamicAllocation.MinExecutors != nil { | ||
initialExecutors = max(initialExecutors, *app.Spec.DynamicAllocation.MinExecutors) | ||
} | ||
if app.Spec.DynamicAllocation.InitialExecutors != nil { | ||
initialExecutors = max(initialExecutors, *app.Spec.DynamicAllocation.InitialExecutors) | ||
} | ||
} | ||
|
||
return initialExecutors | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic of calculating initial executors is not the same as spark core. Ref: https://github.com/apache/spark/blob/899fad4710bef174684deee64314ac483c16c494/core/src/main/scala/org/apache/spark/scheduler/cluster/SchedulerBackendUtils.scala#L23-L47.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can move this funcNumInitialExecutors
to pkg/util/sparkapplication.go
as this is an util function for SparkApplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice, good catch. I think I'd only been looking at https://github.com/apache/spark/blob/899fad4710bef174684deee64314ac483c16c494/core/src/main/scala/org/apache/spark/util/Utils.scala#L2534-L2557 when I wrote this implementation.
Will fix and move to utils tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to match the core implementation and moved to pkg/utils/sparkapplication.go
@@ -0,0 +1,148 @@ | |||
package yunikorn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add license comment in every new file, like
spark-operator/internal/controller/sparkapplication/controller.go
Lines 1 to 15 in 5972482
/* | |
Copyright 2024 The Kubeflow authors. | |
Licensed under the Apache License, Version 2.0 (the "License"); | |
you may not use this file except in compliance with the License. | |
You may obtain a copy of the License at | |
https://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software | |
distributed under the License is distributed on an "AS IS" BASIS, | |
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
See the License for the specific language governing permissions and | |
limitations under the License. | |
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the license comment to the top of all my new files, but what do you think if in a separate PR we add a license comment check as part of CI, and maybe a Makefile target for checking and adding locally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a good point to add license check as part of CI. cc @andreyvelich @vara-bonthu @yuchaoran2011 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue for it #2139
driverTaskGroupName = "spark-driver" | ||
executorTaskGroupName = "spark-executor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that in Volcano scheduler, the PodGroup names are different between SparkApplications. So I am wondering whether we need to set distinct task group names for different SparkApplications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Yunikorn, a task group is a property of an application, so the task group names only need to be unique within the application they're a part of.
Application IDs must be unique in the same way that a Volcano PodGroup name has to be, however there's a nice integration in Yunikorn to use the spark-app-selector
label for the application ID.
Signed-off-by: Jacob Salway <[email protected]>
1172344
to
92969cd
Compare
Signed-off-by: Jacob Salway <[email protected]>
Signed-off-by: Jacob Salway <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Great to see that all the edge cases including min overhead memory are covered.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ChenYi015, yuchaoran2011 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Add Yunikorn scheduler and example Signed-off-by: Jacob Salway <[email protected]> * Add test cases Signed-off-by: Jacob Salway <[email protected]> * Add code comments Signed-off-by: Jacob Salway <[email protected]> * Add license comment Signed-off-by: Jacob Salway <[email protected]> * Inline mergeNodeSelector Signed-off-by: Jacob Salway <[email protected]> * Fix initial number implementation Signed-off-by: Jacob Salway <[email protected]> --------- Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit 8fcda12)
* Add Yunikorn scheduler and example Signed-off-by: Jacob Salway <[email protected]> * Add test cases Signed-off-by: Jacob Salway <[email protected]> * Add code comments Signed-off-by: Jacob Salway <[email protected]> * Add license comment Signed-off-by: Jacob Salway <[email protected]> * Inline mergeNodeSelector Signed-off-by: Jacob Salway <[email protected]> * Fix initial number implementation Signed-off-by: Jacob Salway <[email protected]> --------- Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit 8fcda12) Signed-off-by: Yi Chen <[email protected]>
* Support gang scheduling with Yunikorn (#2107) * Add Yunikorn scheduler and example Signed-off-by: Jacob Salway <[email protected]> * Add test cases Signed-off-by: Jacob Salway <[email protected]> * Add code comments Signed-off-by: Jacob Salway <[email protected]> * Add license comment Signed-off-by: Jacob Salway <[email protected]> * Inline mergeNodeSelector Signed-off-by: Jacob Salway <[email protected]> * Fix initial number implementation Signed-off-by: Jacob Salway <[email protected]> --------- Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit 8fcda12) Signed-off-by: Yi Chen <[email protected]> * Update Makefile for building sparkctl (#2119) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit 4bc6e89) Signed-off-by: Yi Chen <[email protected]> * fix: Add default values for namespaces to match usage descriptions (#2128) * fix: Add default values for namespaces to match usage descriptions Signed-off-by: pengfei4.li <[email protected]> * fix: remove incorrect cache settings Signed-off-by: pengfei4.li <[email protected]> --------- Signed-off-by: pengfei4.li <[email protected]> Co-authored-by: pengfei4.li <[email protected]> (cherry picked from commit 52f818d) Signed-off-by: Yi Chen <[email protected]> * Fix: Spark role binding did not render properly when setting spark service account name (#2135) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit a1a38ea) Signed-off-by: Yi Chen <[email protected]> * Reintroduce option webhook.enable (#2142) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit 9e88049) Signed-off-by: Yi Chen <[email protected]> * Add default batch scheduler argument (#2143) * Add default batch scheduler argument Signed-off-by: Jacob Salway <[email protected]> * Add helm unit test Signed-off-by: Jacob Salway <[email protected]> --------- Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit 9cc1c02) Signed-off-by: Yi Chen <[email protected]> * fix: unable to set controller/webhook replicas to zero (#2147) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit 1afa72e) Signed-off-by: Yi Chen <[email protected]> * Adding support for setting spark job namespaces to all namespaces (#2123) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit c93b0ec) Signed-off-by: Yi Chen <[email protected]> * Support extended kube-scheduler as batch scheduler (#2136) * Support coscheduling with kube-scheduler plugins Signed-off-by: Yi Chen <[email protected]> * Add example for using kube-schulder coscheduling Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit e8d3de9) Signed-off-by: Yi Chen <[email protected]> * Run e2e tests on Kind (#2148) Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit c810ece) Signed-off-by: Yi Chen <[email protected]> * Set schedulerName to Yunikorn (#2153) Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit 62b4ca6) Signed-off-by: Yi Chen <[email protected]> * Create role and rolebinding for controller/webhook in every spark job namespace if not watching all namespaces (#2129) watching all namespaces Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit 592b649) Signed-off-by: Yi Chen <[email protected]> * Fix: e2e test failes due to webhook not ready (#2149) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit dee91ba) Signed-off-by: Yi Chen <[email protected]> * Upgrade to Go 1.23.1 (#2155) Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit 10fcb8e) Signed-off-by: Yi Chen <[email protected]> * Upgrade to Spark 3.5.2 (#2154) Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit e1b7a27) Signed-off-by: Yi Chen <[email protected]> * Bump sigs.k8s.io/scheduler-plugins from 0.29.7 to 0.29.8 (#2159) Bumps [sigs.k8s.io/scheduler-plugins](https://github.com/kubernetes-sigs/scheduler-plugins) from 0.29.7 to 0.29.8. - [Release notes](https://github.com/kubernetes-sigs/scheduler-plugins/releases) - [Changelog](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/RELEASE.md) - [Commits](kubernetes-sigs/scheduler-plugins@v0.29.7...v0.29.8) --- updated-dependencies: - dependency-name: sigs.k8s.io/scheduler-plugins dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> (cherry picked from commit 95d202e) Signed-off-by: Yi Chen <[email protected]> * feat: support driver and executor pod use different priority (#2146) * feat: support driver and executor pod use different priority Signed-off-by: Kevin Wu <[email protected]> * feat: if *app.Spec.Driver.PriorityClassName and *app.Spec.Executor.PriorityClassName specifically defined, then can precedence over spec.batchSchedulerOptions.priorityClassName Signed-off-by: Kevin Wu <[email protected]> * feat: merge the logic of setPodPriorityClassName into addPriorityClassName Signed-off-by: Kevin Wu <[email protected]> * feat: support driver and executor pod use different priority Signed-off-by: Kevin Wu <[email protected]> Signed-off-by: Kevin.Wu <[email protected]> * feat: if *app.Spec.Driver.PriorityClassName and *app.Spec.Executor.PriorityClassName specifically defined, then can precedence over spec.batchSchedulerOptions.priorityClassName Signed-off-by: Kevin Wu <[email protected]> Signed-off-by: Kevin.Wu <[email protected]> * feat: merge the logic of setPodPriorityClassName into addPriorityClassName Signed-off-by: Kevin Wu <[email protected]> Signed-off-by: Kevin.Wu <[email protected]> * feat: add adjust pointer if is nil Signed-off-by: Kevin.Wu <[email protected]> * feat: remove spec.batchSchedulerOptions.priorityClassName define , split driver and executor pod priorityClass Signed-off-by: Kevin Wu <[email protected]> * feat: remove spec.batchSchedulerOptions.priorityClassName define , split driver and executor pod priorityClass Signed-off-by: Kevin Wu <[email protected]> * feat: Optimize code to avoid null pointer exceptions Signed-off-by: Kevin.Wu <[email protected]> * fix: remove backup crd files Signed-off-by: Kevin.Wu <[email protected]> * fix: remove BatchSchedulerOptions.PriorityClassName test code Signed-off-by: Kevin Wu <[email protected]> * fix: add driver and executor pod priorityClassName test code Signed-off-by: Kevin Wu <[email protected]> --------- Signed-off-by: Kevin Wu <[email protected]> Signed-off-by: Kevin.Wu <[email protected]> Co-authored-by: Kevin Wu <[email protected]> (cherry picked from commit 6ae1b2f) Signed-off-by: Yi Chen <[email protected]> * Bump gocloud.dev from 0.37.0 to 0.39.0 (#2160) Bumps [gocloud.dev](https://github.com/google/go-cloud) from 0.37.0 to 0.39.0. - [Release notes](https://github.com/google/go-cloud/releases) - [Commits](google/go-cloud@v0.37.0...v0.39.0) --- updated-dependencies: - dependency-name: gocloud.dev dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> (cherry picked from commit e58023b) Signed-off-by: Yi Chen <[email protected]> * Update e2e tests (#2161) * Add sleep buffer to ensture the webhooks are ready before running the e2e tests Signed-off-by: Yi Chen <[email protected]> * Remove duplicate operator image build tasks Signed-off-by: Yi Chen <[email protected]> * Update e2e tests Signed-off-by: Yi Chen <[email protected]> * Update examples Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit e6a7805) Signed-off-by: Yi Chen <[email protected]> * fix: webhook not working when settings spark job namespaces to empty (#2163) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit 7785107) Signed-off-by: Yi Chen <[email protected]> * fix: The logger had an odd number of arguments, making it panic (#2166) Signed-off-by: tcassaert <[email protected]> (cherry picked from commit eb48b34) Signed-off-by: Yi Chen <[email protected]> * Upgrade to Spark 3.5.2(#2012) (#2157) * Upgrade to Spark 3.5.2 Signed-off-by: HyukSangCho <[email protected]> * Upgrade to Spark 3.5.2 Signed-off-by: HyukSangCho <[email protected]> * Upgrade to Spark 3.5.2 Signed-off-by: HyukSangCho <[email protected]> * Upgrade to Spark 3.5.2 Signed-off-by: HyukSangCho <[email protected]> --------- Signed-off-by: HyukSangCho <[email protected]> (cherry picked from commit 9f0c08a) Signed-off-by: Yi Chen <[email protected]> * Feature: Add pprof endpoint (#2164) * add pprof support to the operator Controller Manager Signed-off-by: ImpSy <[email protected]> * add pprof support to helm chart Signed-off-by: ImpSy <[email protected]> --------- Signed-off-by: ImpSy <[email protected]> (cherry picked from commit 75b9266) Signed-off-by: Yi Chen <[email protected]> * fix the make kind-delete-custer to avoid accidental kubeconfig deletion (#2172) Signed-off-by: ImpSy <[email protected]> (cherry picked from commit cbfefd5) Signed-off-by: Yi Chen <[email protected]> * Bump github.com/aws/aws-sdk-go-v2/config from 1.27.27 to 1.27.33 (#2174) Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.27.27 to 1.27.33. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Commits](aws/aws-sdk-go-v2@config/v1.27.27...config/v1.27.33) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/config dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> (cherry picked from commit b818332) Signed-off-by: Yi Chen <[email protected]> * Bump helm.sh/helm/v3 from 3.15.3 to 3.16.1 (#2173) Bumps [helm.sh/helm/v3](https://github.com/helm/helm) from 3.15.3 to 3.16.1. - [Release notes](https://github.com/helm/helm/releases) - [Commits](helm/helm@v3.15.3...v3.16.1) --- updated-dependencies: - dependency-name: helm.sh/helm/v3 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> (cherry picked from commit f3f80d4) Signed-off-by: Yi Chen <[email protected]> * Add specific error in log line when failed to create web UI service (#2170) * Add specific error in log line when failed to create web UI service Signed-off-by: tcassaert <[email protected]> * Update log to reflect correct resource that could not be created Co-authored-by: Yi Chen <[email protected]> Signed-off-by: tcassaert <[email protected]> --------- Signed-off-by: tcassaert <[email protected]> Signed-off-by: tcassaert <[email protected]> Co-authored-by: Yi Chen <[email protected]> (cherry picked from commit ed3226e) Signed-off-by: Yi Chen <[email protected]> * Account for spark.executor.pyspark.memory in Yunikorn gang scheduling (#2178) Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit a2f71c6) Signed-off-by: Yi Chen <[email protected]> * Fix: spark application does not respect time to live seconds (#2165) * Add time to live seconds example spark application Signed-off-by: Yi Chen <[email protected]> * fix: spark application does not respect time to live seconds Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit c855ee4) Signed-off-by: Yi Chen <[email protected]> * Update release workflow and docs (#2121) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit bca6aa8) Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Jacob Salway <[email protected]> Signed-off-by: Yi Chen <[email protected]> Signed-off-by: pengfei4.li <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Kevin Wu <[email protected]> Signed-off-by: Kevin.Wu <[email protected]> Signed-off-by: tcassaert <[email protected]> Signed-off-by: HyukSangCho <[email protected]> Signed-off-by: ImpSy <[email protected]> Signed-off-by: tcassaert <[email protected]> Co-authored-by: Jacob Salway <[email protected]> Co-authored-by: Neo <[email protected]> Co-authored-by: pengfei4.li <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kevinz <[email protected]> Co-authored-by: Kevin Wu <[email protected]> Co-authored-by: tcassaert <[email protected]> Co-authored-by: ha2hi <[email protected]> Co-authored-by: Sébastien Maintrot <[email protected]>
* Support gang scheduling with Yunikorn (kubeflow#2107) * Add Yunikorn scheduler and example Signed-off-by: Jacob Salway <[email protected]> * Add test cases Signed-off-by: Jacob Salway <[email protected]> * Add code comments Signed-off-by: Jacob Salway <[email protected]> * Add license comment Signed-off-by: Jacob Salway <[email protected]> * Inline mergeNodeSelector Signed-off-by: Jacob Salway <[email protected]> * Fix initial number implementation Signed-off-by: Jacob Salway <[email protected]> --------- Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit 8fcda12) Signed-off-by: Yi Chen <[email protected]> * Update Makefile for building sparkctl (kubeflow#2119) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit 4bc6e89) Signed-off-by: Yi Chen <[email protected]> * fix: Add default values for namespaces to match usage descriptions (kubeflow#2128) * fix: Add default values for namespaces to match usage descriptions Signed-off-by: pengfei4.li <[email protected]> * fix: remove incorrect cache settings Signed-off-by: pengfei4.li <[email protected]> --------- Signed-off-by: pengfei4.li <[email protected]> Co-authored-by: pengfei4.li <[email protected]> (cherry picked from commit 52f818d) Signed-off-by: Yi Chen <[email protected]> * Fix: Spark role binding did not render properly when setting spark service account name (kubeflow#2135) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit a1a38ea) Signed-off-by: Yi Chen <[email protected]> * Reintroduce option webhook.enable (kubeflow#2142) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit 9e88049) Signed-off-by: Yi Chen <[email protected]> * Add default batch scheduler argument (kubeflow#2143) * Add default batch scheduler argument Signed-off-by: Jacob Salway <[email protected]> * Add helm unit test Signed-off-by: Jacob Salway <[email protected]> --------- Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit 9cc1c02) Signed-off-by: Yi Chen <[email protected]> * fix: unable to set controller/webhook replicas to zero (kubeflow#2147) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit 1afa72e) Signed-off-by: Yi Chen <[email protected]> * Adding support for setting spark job namespaces to all namespaces (kubeflow#2123) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit c93b0ec) Signed-off-by: Yi Chen <[email protected]> * Support extended kube-scheduler as batch scheduler (kubeflow#2136) * Support coscheduling with kube-scheduler plugins Signed-off-by: Yi Chen <[email protected]> * Add example for using kube-schulder coscheduling Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit e8d3de9) Signed-off-by: Yi Chen <[email protected]> * Run e2e tests on Kind (kubeflow#2148) Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit c810ece) Signed-off-by: Yi Chen <[email protected]> * Set schedulerName to Yunikorn (kubeflow#2153) Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit 62b4ca6) Signed-off-by: Yi Chen <[email protected]> * Create role and rolebinding for controller/webhook in every spark job namespace if not watching all namespaces (kubeflow#2129) watching all namespaces Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit 592b649) Signed-off-by: Yi Chen <[email protected]> * Fix: e2e test failes due to webhook not ready (kubeflow#2149) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit dee91ba) Signed-off-by: Yi Chen <[email protected]> * Upgrade to Go 1.23.1 (kubeflow#2155) Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit 10fcb8e) Signed-off-by: Yi Chen <[email protected]> * Upgrade to Spark 3.5.2 (kubeflow#2154) Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit e1b7a27) Signed-off-by: Yi Chen <[email protected]> * Bump sigs.k8s.io/scheduler-plugins from 0.29.7 to 0.29.8 (kubeflow#2159) Bumps [sigs.k8s.io/scheduler-plugins](https://github.com/kubernetes-sigs/scheduler-plugins) from 0.29.7 to 0.29.8. - [Release notes](https://github.com/kubernetes-sigs/scheduler-plugins/releases) - [Changelog](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/RELEASE.md) - [Commits](kubernetes-sigs/scheduler-plugins@v0.29.7...v0.29.8) --- updated-dependencies: - dependency-name: sigs.k8s.io/scheduler-plugins dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> (cherry picked from commit 95d202e) Signed-off-by: Yi Chen <[email protected]> * feat: support driver and executor pod use different priority (kubeflow#2146) * feat: support driver and executor pod use different priority Signed-off-by: Kevin Wu <[email protected]> * feat: if *app.Spec.Driver.PriorityClassName and *app.Spec.Executor.PriorityClassName specifically defined, then can precedence over spec.batchSchedulerOptions.priorityClassName Signed-off-by: Kevin Wu <[email protected]> * feat: merge the logic of setPodPriorityClassName into addPriorityClassName Signed-off-by: Kevin Wu <[email protected]> * feat: support driver and executor pod use different priority Signed-off-by: Kevin Wu <[email protected]> Signed-off-by: Kevin.Wu <[email protected]> * feat: if *app.Spec.Driver.PriorityClassName and *app.Spec.Executor.PriorityClassName specifically defined, then can precedence over spec.batchSchedulerOptions.priorityClassName Signed-off-by: Kevin Wu <[email protected]> Signed-off-by: Kevin.Wu <[email protected]> * feat: merge the logic of setPodPriorityClassName into addPriorityClassName Signed-off-by: Kevin Wu <[email protected]> Signed-off-by: Kevin.Wu <[email protected]> * feat: add adjust pointer if is nil Signed-off-by: Kevin.Wu <[email protected]> * feat: remove spec.batchSchedulerOptions.priorityClassName define , split driver and executor pod priorityClass Signed-off-by: Kevin Wu <[email protected]> * feat: remove spec.batchSchedulerOptions.priorityClassName define , split driver and executor pod priorityClass Signed-off-by: Kevin Wu <[email protected]> * feat: Optimize code to avoid null pointer exceptions Signed-off-by: Kevin.Wu <[email protected]> * fix: remove backup crd files Signed-off-by: Kevin.Wu <[email protected]> * fix: remove BatchSchedulerOptions.PriorityClassName test code Signed-off-by: Kevin Wu <[email protected]> * fix: add driver and executor pod priorityClassName test code Signed-off-by: Kevin Wu <[email protected]> --------- Signed-off-by: Kevin Wu <[email protected]> Signed-off-by: Kevin.Wu <[email protected]> Co-authored-by: Kevin Wu <[email protected]> (cherry picked from commit 6ae1b2f) Signed-off-by: Yi Chen <[email protected]> * Bump gocloud.dev from 0.37.0 to 0.39.0 (kubeflow#2160) Bumps [gocloud.dev](https://github.com/google/go-cloud) from 0.37.0 to 0.39.0. - [Release notes](https://github.com/google/go-cloud/releases) - [Commits](google/go-cloud@v0.37.0...v0.39.0) --- updated-dependencies: - dependency-name: gocloud.dev dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> (cherry picked from commit e58023b) Signed-off-by: Yi Chen <[email protected]> * Update e2e tests (kubeflow#2161) * Add sleep buffer to ensture the webhooks are ready before running the e2e tests Signed-off-by: Yi Chen <[email protected]> * Remove duplicate operator image build tasks Signed-off-by: Yi Chen <[email protected]> * Update e2e tests Signed-off-by: Yi Chen <[email protected]> * Update examples Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit e6a7805) Signed-off-by: Yi Chen <[email protected]> * fix: webhook not working when settings spark job namespaces to empty (kubeflow#2163) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit 7785107) Signed-off-by: Yi Chen <[email protected]> * fix: The logger had an odd number of arguments, making it panic (kubeflow#2166) Signed-off-by: tcassaert <[email protected]> (cherry picked from commit eb48b34) Signed-off-by: Yi Chen <[email protected]> * Upgrade to Spark 3.5.2(kubeflow#2012) (kubeflow#2157) * Upgrade to Spark 3.5.2 Signed-off-by: HyukSangCho <[email protected]> * Upgrade to Spark 3.5.2 Signed-off-by: HyukSangCho <[email protected]> * Upgrade to Spark 3.5.2 Signed-off-by: HyukSangCho <[email protected]> * Upgrade to Spark 3.5.2 Signed-off-by: HyukSangCho <[email protected]> --------- Signed-off-by: HyukSangCho <[email protected]> (cherry picked from commit 9f0c08a) Signed-off-by: Yi Chen <[email protected]> * Feature: Add pprof endpoint (kubeflow#2164) * add pprof support to the operator Controller Manager Signed-off-by: ImpSy <[email protected]> * add pprof support to helm chart Signed-off-by: ImpSy <[email protected]> --------- Signed-off-by: ImpSy <[email protected]> (cherry picked from commit 75b9266) Signed-off-by: Yi Chen <[email protected]> * fix the make kind-delete-custer to avoid accidental kubeconfig deletion (kubeflow#2172) Signed-off-by: ImpSy <[email protected]> (cherry picked from commit cbfefd5) Signed-off-by: Yi Chen <[email protected]> * Bump github.com/aws/aws-sdk-go-v2/config from 1.27.27 to 1.27.33 (kubeflow#2174) Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.27.27 to 1.27.33. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Commits](aws/aws-sdk-go-v2@config/v1.27.27...config/v1.27.33) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/config dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> (cherry picked from commit b818332) Signed-off-by: Yi Chen <[email protected]> * Bump helm.sh/helm/v3 from 3.15.3 to 3.16.1 (kubeflow#2173) Bumps [helm.sh/helm/v3](https://github.com/helm/helm) from 3.15.3 to 3.16.1. - [Release notes](https://github.com/helm/helm/releases) - [Commits](helm/helm@v3.15.3...v3.16.1) --- updated-dependencies: - dependency-name: helm.sh/helm/v3 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> (cherry picked from commit f3f80d4) Signed-off-by: Yi Chen <[email protected]> * Add specific error in log line when failed to create web UI service (kubeflow#2170) * Add specific error in log line when failed to create web UI service Signed-off-by: tcassaert <[email protected]> * Update log to reflect correct resource that could not be created Co-authored-by: Yi Chen <[email protected]> Signed-off-by: tcassaert <[email protected]> --------- Signed-off-by: tcassaert <[email protected]> Signed-off-by: tcassaert <[email protected]> Co-authored-by: Yi Chen <[email protected]> (cherry picked from commit ed3226e) Signed-off-by: Yi Chen <[email protected]> * Account for spark.executor.pyspark.memory in Yunikorn gang scheduling (kubeflow#2178) Signed-off-by: Jacob Salway <[email protected]> (cherry picked from commit a2f71c6) Signed-off-by: Yi Chen <[email protected]> * Fix: spark application does not respect time to live seconds (kubeflow#2165) * Add time to live seconds example spark application Signed-off-by: Yi Chen <[email protected]> * fix: spark application does not respect time to live seconds Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit c855ee4) Signed-off-by: Yi Chen <[email protected]> * Update release workflow and docs (kubeflow#2121) Signed-off-by: Yi Chen <[email protected]> (cherry picked from commit bca6aa8) Signed-off-by: Yi Chen <[email protected]> --------- Signed-off-by: Jacob Salway <[email protected]> Signed-off-by: Yi Chen <[email protected]> Signed-off-by: pengfei4.li <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Kevin Wu <[email protected]> Signed-off-by: Kevin.Wu <[email protected]> Signed-off-by: tcassaert <[email protected]> Signed-off-by: HyukSangCho <[email protected]> Signed-off-by: ImpSy <[email protected]> Signed-off-by: tcassaert <[email protected]> Co-authored-by: Jacob Salway <[email protected]> Co-authored-by: Neo <[email protected]> Co-authored-by: pengfei4.li <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kevinz <[email protected]> Co-authored-by: Kevin Wu <[email protected]> Co-authored-by: tcassaert <[email protected]> Co-authored-by: ha2hi <[email protected]> Co-authored-by: Sébastien Maintrot <[email protected]>
Purpose
Add a scheduler implementation for YuniKorn to support task group annotations on the driver pod for gang scheduling and queue labels on both driver and executor pods.
Completes the first dot point of #2098. Will add docs for the Spark operator in a separate PR, but will add docs to YuniKorn as well.
Changes
yunikorn
batchSchedulerOptions.queue
to allow queue configuration. This is not a required field as Yunikorn supports placement rules that allow an app to be placed without explicitly specifying a queueTesting
recording.mp4
Change Category
Indicate the type of change by marking the applicable boxes:
[ ] Documentation updatewill do in a separate PR to avoid this one getting any largerChecklist
Before submitting your PR, please review the following:
[ ] I have updated documentation accordingly.will do in a separate PR to avoid this one getting any larger