From b2672bc1cb610f34dd7eb6256bc3219a92b7d687 Mon Sep 17 00:00:00 2001 From: Jan Chaloupka Date: Mon, 27 Jan 2020 12:30:00 +0100 Subject: [PATCH] Summarize gang scheduling implementation and describe how to deploy it in OpenShift --- scheduling/gang-scheduling.md | 225 ++++++++++++++++++++++++++++++++++ 1 file changed, 225 insertions(+) create mode 100644 scheduling/gang-scheduling.md diff --git a/scheduling/gang-scheduling.md b/scheduling/gang-scheduling.md new file mode 100644 index 00000000000..faf00b0ee44 --- /dev/null +++ b/scheduling/gang-scheduling.md @@ -0,0 +1,225 @@ +--- +title: gang-scheduling-in-kubernetes +authors: + - "@ingvagabund" +reviewers: + - "@damemi" +approvers: + - "@soltysh" +creation-date: 2020-02-27 +last-updated: 2020-02-27 +status: provisional +see-also: +replaces: +superseded-by: +--- + +# Gang scheduling in Kubernetes + +What is gang scheduling (or coscheduling) is described in [KEP proposal](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/34-20180703-coscheduling.md). +Implementation of gang scheduling can be found under two repositories: +- https://github.com/kubernetes-sigs/kube-batch/ (implementing multi-tenant batch scheduling and resource sharing between prioritized queues) +- https://github.com/volcano-sh/volcano (extends and builds on top of batch scheduling framework from kube-batch) + +Both repositories integrate [multi-level queue scheduling](https://en.wikipedia.org/wiki/Multilevel_queue) principle. +More about specific queue management implementation in [this doc](https://github.com/volcano-sh/volcano/blob/master/docs/design/queue/queue-state-management.md). + +## Goals + +1. Describe the current state of gang scheduling in Kubernetes. +2. Collect knowledge about its design, how it works, how to deploy it in OpenShift +3. Once we decide to integrate the feature within our portfolio, we don't need to revisit +and re-collect the same knowledge again. + +## Summary + +- The kube-batch scheduling framework is extendable through plugins and actions. +- Plugins allow to define various conditions (e.g. when a pod is considered evictable, when a queue is overused), priorities (e.g. defining order of jobs/queues processing), node scoring and other concepts. Providing building blocks for actions which implement various scheduling decision making logic. +- The framework provides default plugins such as `DRF` (focusing on fair job scoring), `Gang` (enforcing gang scheduling principles) or `Predicates`/`Priorities` (exposing predicates/priorities from the kube scheduler). With default actions such as `Allocate` (for scheduling group of pods as a single unit) or +`Preempt`/`Reclaim` (preempting group of pods wrt. multi-level queue and priority classes). +- All plugins and actions live under tiers which dictate in which order and how they are processed. +- The default plugins, actions and tiers are described in more detail [here](https://github.com/kubernetes-sigs/kube-batch/blob/master/doc/design/framework.md). +- Other documents describing design elements are available [here](https://github.com/kubernetes-sigs/kube-batch/tree/master/doc/design) and [here]( https://github.com/volcano-sh/volcano/tree/master/docs/design). +- Any consumer of the framework can write its own plugins and actions and thus extend/change the scheduling +decision making. + +There's also an effort to integrate some functionality of kube-batch framework with [framework](https://kubernetes.io/docs/concepts/configuration/scheduling-framework/) +provided by the default kube-scheduler. + + +To learn more about the individual plugins and actions, check the code base under https://github.com/kubernetes-sigs/kube-batch/. + +## Deploy and test + +I am considering Vulcano as a referential implementation of the gang schediling for now. +The repository seems more active and alive than kube-batch. + +Vulcano stack consists of (among other bits): +- Queue controller - managing lifecycle of queues (wrt. PodGroups/Jobs) +- Job controller - managing tasks of jobs and turning them into pods +- Admission controller - checking availability of a queue when creating a PodGroup/Job (through webhooks) +- CRDs for queue, jobs, etc. + +To deploy the stack, edit and apply https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml: +- change port 443 to 6443 +- `volcano-controllers` cluster role to extend RBAC rule for `jobs` with `jobs/finalizers`: + ``` + - apiGroups: + - batch.volcano.sh + resources: + - jobs + - jobs/finalizers + verbs: + - get + - list + - watch + - update + - delete + ``` + +Also notice the scheduler configuration: +``` +apiVersion: v1 +kind: ConfigMap +metadata: + name: volcano-scheduler-configmap + namespace: volcano-system +data: + volcano-scheduler.conf: | + actions: "enqueue, allocate, backfill" + tiers: + - plugins: + - name: priority + - name: gang + - name: conformance + - plugins: + - name: drf + - name: predicates + - name: proportion + - name: nodeorder + - name: binpack +``` + +The configuration enables three actions and two tiers of plugins. +With `gang` plugin enabled, the scheduler will require minimal number of replicas +to be schedulable for each job before job's state can be set to `Running`. + +### Example job + +Each job (`jobs.batch.volcano.sh` CRD) has a list of task categories, each task category with its own pod template and number of replicas. + +You can use the following CR to see how the scheduler behaves (manifest borrowed from https://github.com/volcano-sh/volcano/blob/master/example/job.yaml): + +``` +apiVersion: batch.volcano.sh/v1alpha1 +kind: Job +metadata: + name: test-job +spec: + minAvailable: 3 + schedulerName: volcano + policies: + - event: PodEvicted + action: RestartJob + maxRetry: 5 + queue: default + tasks: + - replicas: 6 + name: "default-nginx" + template: + metadata: + name: web + spec: + containers: + - image: nginx + imagePullPolicy: IfNotPresent + name: nginx + resources: + requests: + cpu: "1" + restartPolicy: OnFailure +``` + +Notice the cpu request is set to `1` cpu and the minimal number of replicas is `3`. +In case your cluster does not have enough cpu resource to schedule at least 3 replicas, +the job will not change it's state to `Running` (due to gang scheduling minimal replicas constraint). + +With sufficient cpu resource (e.g. setting cpu request to `150m`) you get: + +``` +$ oc get pods +NAME READY STATUS RESTARTS AGE +test-job-default-nginx-0 0/1 OutOfcpu 0 6m3s +test-job-default-nginx-1 0/1 OutOfcpu 0 6m4s +test-job-default-nginx-2 1/1 Running 0 6m4s +test-job-default-nginx-3 1/1 Running 0 6m4s +test-job-default-nginx-4 0/1 OutOfcpu 0 6m3s +test-job-default-nginx-5 1/1 Running 0 6m4s +``` + +With insufficient cpu resource none of the pods gets to run: + +``` +$ oc get pods +NAME READY STATUS RESTARTS AGE +test-job-default-nginx-0 0/1 Pending 0 9m38s +test-job-default-nginx-1 0/1 Pending 0 9m38s +test-job-default-nginx-2 0/1 Pending 0 9m38s +test-job-default-nginx-3 0/1 Pending 0 9m38s +test-job-default-nginx-4 0/1 Pending 0 9m38s +test-job-default-nginx-5 0/1 Pending 0 9m38s +``` + +### Scheduling cycle logs snippet + +``` +I0124 13:06:29.087366 1 cache.go:775] There are <1> Jobs, <1> Queues and <6> Nodes in total for scheduling. +I0124 13:06:29.087401 1 session.go:135] Open Session 55ab3cf8-3eaa-11ea-a61c-0a580a81020a with <1> Job and <1> Queues +I0124 13:06:29.088161 1 enqueue.go:55] Enter Enqueue ... +I0124 13:06:29.088181 1 enqueue.go:70] Added Queue for Job +I0124 13:06:29.088196 1 enqueue.go:87] Try to enqueue PodGroup to 0 Queues +I0124 13:06:29.088226 1 enqueue.go:134] Leaving Enqueue ... +I0124 13:06:29.088243 1 allocate.go:43] Enter Allocate ... +I0124 13:06:29.088261 1 allocate.go:94] Try to allocate resource to 1 Namespaces +I0124 13:06:29.088277 1 allocate.go:147] Try to allocate resource to Jobs in Namespace Queue +I0124 13:06:29.088306 1 allocate.go:172] Try to allocate resource to 6 tasks of Job +I0124 13:06:29.088326 1 allocate.go:180] There are <6> nodes for Job +I0124 13:06:29.088373 1 scheduler_helper.go:87] Considering Task on node : vs. +I0124 13:06:29.088481 1 scheduler_helper.go:92] Predicates failed for task on node : task default/test-job-default-nginx-2 on node ip-10-0-133-65.ec2.internal fit failed: node(s) had taints that the pod didn't tolerate +I0124 13:06:29.088540 1 scheduler_helper.go:87] Considering Task on node : vs. +I0124 13:06:29.088566 1 scheduler_helper.go:92] Predicates failed for task on node : task default/test-job-default-nginx-2 on node ip-10-0-136-163.ec2.internal fit failed: node(s) resource fit failed +I0124 13:06:29.088579 1 scheduler_helper.go:87] Considering Task on node : vs. +I0124 13:06:29.088601 1 scheduler_helper.go:92] Predicates failed for task on node : task default/test-job-default-nginx-2 on node ip-10-0-146-126.ec2.internal fit failed: node(s) resource fit failed +I0124 13:06:29.088613 1 scheduler_helper.go:87] Considering Task on node : vs. +I0124 13:06:29.088638 1 scheduler_helper.go:92] Predicates failed for task on node : task default/test-job-default-nginx-2 on node ip-10-0-154-53.ec2.internal fit failed: node(s) had taints that the pod didn't tolerate +I0124 13:06:29.088650 1 scheduler_helper.go:87] Considering Task on node : vs. +I0124 13:06:29.088673 1 scheduler_helper.go:92] Predicates failed for task on node : task default/test-job-default-nginx-2 on node ip-10-0-163-197.ec2.internal fit failed: node(s) had taints that the pod didn't tolerate +I0124 13:06:29.088685 1 scheduler_helper.go:87] Considering Task on node : vs. +I0124 13:06:29.088706 1 scheduler_helper.go:92] Predicates failed for task on node : task default/test-job-default-nginx-2 on node ip-10-0-167-126.ec2.internal fit failed: node(s) resource fit failed +I0124 13:06:29.088735 1 statement.go:312] Discarding operations ... +I0124 13:06:29.088753 1 allocate.go:147] Try to allocate resource to Jobs in Namespace Queue +I0124 13:06:29.088777 1 allocate.go:241] Leaving Allocate ... +I0124 13:06:29.088792 1 backfill.go:42] Enter Backfill ... +I0124 13:06:29.088805 1 backfill.go:91] Leaving Backfill ... +I0124 13:06:29.100766 1 session.go:154] Close Session 55ab3cf8-3eaa-11ea-a61c-0a580a81020a +``` + +Posting the entire snippet hear so you can see how the scheduler actually works. +Every time a scheduling cycle occurs, a session is opened (reading plugins and actions). +Then, based on specified configuration individual actions are triggered. +Starting with `Enqueue` action, followed with `Allocate` and `Backfill`. +`Allocate` action is responsible for scheduling job's tasks. As you can see +`default/test-job-default-nginx-2` pod can't be schedule due to insufficient resources +or taints not tolerated. + +## What's next + +Vulcano does not necessarily focus only on batch scheduling. It also wants to +[incorporate various topologies](https://github.com/volcano-sh/volcano/blob/master/docs/community/roadmap.md) (e.g. GPU) to improve allocation of resources. + +It's also likely upstream will want to consume the gang scheduling feature as a [plugin](https://github.com/hex108/coscheduling-plugin) +in the scheduling framework instead of utilizing entire kube-batch code base. + +Communities discussing gang scheduling: +- sig-scheduling +- wg-machine-learning