From 581a08b75dcac3a9cdeb3257470c22e01d35d71f Mon Sep 17 00:00:00 2001 From: "Da K. Ma" Date: Thu, 3 Jan 2019 18:13:58 +0800 Subject: [PATCH] Removed old design doc to avoid confusion. Signed-off-by: Da K. Ma --- doc/design/preemption.md | 237 ------------------------------------- doc/design/queue_api.md | 185 ----------------------------- doc/design/queuejob_api.md | 57 --------- 3 files changed, 479 deletions(-) delete mode 100644 doc/design/preemption.md delete mode 100644 doc/design/queue_api.md delete mode 100644 doc/design/queuejob_api.md diff --git a/doc/design/preemption.md b/doc/design/preemption.md deleted file mode 100644 index 8e24d74e7..000000000 --- a/doc/design/preemption.md +++ /dev/null @@ -1,237 +0,0 @@ -# Preemption design behaviour - -@jinzhejz, 12/13/2017 - -## Overview -The document shows detail behaviour of preemption. - -## API -```go -type Interface interface { - Run(stopCh <-chan struct{}) - - Preprocessing(queues map[string]*schedulercache.QueueInfo, pods []*schedulercache.PodInfo) (map[string]*schedulercache.QueueInfo, error) - - PreemptResources(queues map[string]*schedulercache.QueueInfo) error -} -``` - -A preemptor providers three interfaces, `Run()`,`Preprocessing()` and `PreemptResources()` - -* `Run()` to start informer for preemption. -* `Preprocessing()` to preprocess queues, make sure `Allocated >= Used` in each queue. -* `PreemptResources()` to preempt resources between different queues. - -### Preprocess stage -Currently, this stage terminates pods of the queue which `Allocated < Used` to avoid overuse(make queue `Allocated >= Used`), preemption will not be triggered between queues. - -``` -For examples: --------------------------------------------------------------------------- -| Queue-1 | -| Weight: 2 | -| | -| Allocated: Used: Pods: | -| cpu: 3 cpu: 5 pod-1: cpu=2 memory=1Gi | -| memory: 9Gi memory: 3Gi pod-2: cpu=2 memory=1Gi | -| pod-3: cpu=1 memory=1Gi | --------------------------------------------------------------------------- - --------------------------------------------------------------------------- -| Queue-2 | -| Weight: 4 | -| | -| Allocated: Used: Pods: | -| cpu: 6 cpu: 5 pod-1: cpu=2 memory=1Gi | -| memory: 18Gi memory: 3Gi pod-2: cpu=2 memory=1Gi | -| pod-3: cpu=1 memory=1Gi | --------------------------------------------------------------------------- - -There are 9 CPUs totally. 3 CPUs are allocated to Queue-1 and 6 CPUs are allocated to Queue-2. -However, Queue-1 is overused due to some race condition. So pod-1 will be chosen to terminate. The pod is randomly selected currently. --------------------------------------------------------------------------- -| Queue-1 | -| Weight: 2 | -| | -| Allocated: Used: Pods: | -| cpu: 3 cpu: 3 pod-2: cpu=2 memory=1Gi | -| memory: 9Gi memory: 2Gi pod-3: cpu=1 memory=1Gi | --------------------------------------------------------------------------- -``` - -### Preemption stage -Preempt resources between queues. This stage will divide all queues into three categories: - -* Case-01 : `Deserved < Allocated` -* Case-02 : `Deserved = Allocated` -* Case-03 : `Deserved > Allocated` - -The queues in case-01 will be preempted resources to other queues. It contains two subcases: - -* Case-01.1 : `Used <= Deserved`. The resources `Allocated - Deserved` are not used by pods, and it can be assigned to the queues in case-03 directly. No pods will be terminated and `Allocated` will be changed to `Deserved` for the queue directly. -![workflow](../images/preemption-01.1.jpg) - -* Case-01.2 : `Used > Deserved`. The resources `Allocated - Used` are not used by pods, and it can be assigned to the queues in case-03 directly. The resources `Used - Deserved` is used by running pods. This will trigger terminate pod to preempt resource. `Allocated` will be changed to `Deserved` directly for the queue first to avoid more pods coming and then some pods will be chosen randomly to kill to release resources `Used - Deserved` which is marked as Preempting Resources. -![workflow](../images/preemption-01.2.jpg) - -The queues in case-02 occupy the right resources, they are constant at this stage. - -The queues in case-03 will preempt resources from queues in case-01. - -* The `Allocated - Deserved` resources in case-01.1 and `Allocated - Used` resources in case-01.2 will be assigned to these queues immediately. And its `Allocated` resources will be changed according to the increased resources. -* The `Used - Deserved` resources will also be assigned to these queues as `Preempting` resource, and `Allocated` resources will not be changed at this moment. After the pod termination in case-01.2 is done, the `Allocated` resources will be updated. - -``` -Preemption examples: - -There are two queues and resources allocated as follow, Deserved is same as Allocated: --------------------------------------------------------------------------- -| Queue-1 | -| Weight: 2 | -| | -| Allocated: Used: Pods: | -| cpu: 3 cpu: 3 pod-2: cpu=2 memory=1Gi | -| memory: 9Gi memory: 2Gi pod-3: cpu=1 memory=1Gi | --------------------------------------------------------------------------- --------------------------------------------------------------------------- -| Queue-2 | -| Weight: 4 | -| | -| Allocated: Used: Pods: | -| cpu: 6 cpu: 5 pod-1: cpu=2 memory=1Gi | -| memory: 18Gi memory: 3Gi pod-2: cpu=2 memory=1Gi | -| pod-3: cpu=1 memory=1Gi | --------------------------------------------------------------------------- - -There is new queue (Queue-3) coming ---------------- -| Queue-3 | -| Weight: 3 | ---------------- - -After proportion Policy --------------------------------------------------------------------------- -| Queue-1 | -| Weight: 2 | -| | -| Deserved: | -| cpu: 2 | -| memory: 6Gi | -| | -| Allocated: Used: Pods: | -| cpu: 3 cpu: 3 pod-2: cpu=2 memory=1Gi | -| memory: 9Gi memory: 2Gi pod-3: cpu=1 memory=1Gi | --------------------------------------------------------------------------- --------------------------------------------------------------------------- -| Queue-2 | -| Weight: 4 | -| | -| Deserved: | -| cpu: 4 | -| memory: 12Gi | -| | -| Allocated: Used: Pods: | -| cpu: 6 cpu: 5 pod-1: cpu=2 memory=1Gi | -| memory: 18Gi memory: 3Gi pod-2: cpu=2 memory=1Gi | -| pod-3: cpu=1 memory=1Gi | --------------------------------------------------------------------------- --------------------------------------------------------------------------- -| Queue-3 | -| Weight: 3 | -| | -| Deserved: | -| cpu: 3 | -| memory: 9Gi | -| | -| Allocated: Used: Pods: | -| cpu: 0 cpu: 0 N/A | -| memory: 0Gi memory: 0Gi | --------------------------------------------------------------------------- - -Queue-1 overuse 1 CPU and Queue-2 overuse 1 CPU. Some pods in Queue-1/Queue-2 will be terminated to releasing these resources. Such as, pod-3 in Queue-1 and pod-3 in Queue-2 will be selected to kill. - -And other resources will be assigned to Queue-03 immediately, now Queue status as follow --------------------------------------------------------------------------------------- -| Queue-1 | -| Weight: 2 | -| | -| Deserved: | -| cpu: 2 | -| memory: 6Gi | -| | -| Allocated: Used: Pods: | -| cpu: 2 cpu: 3 pod-2: cpu=2 memory=1Gi | -| memory: 6Gi memory: 2Gi pod-3: cpu=1 memory=1Gi(Terminating) | --------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------- -| Queue-2 | -| Weight: 4 | -| | -| Deserved: | -| cpu: 4 | -| memory: 12Gi | -| | -| Allocated: Used: Pods: | -| cpu: 4 cpu: 5 pod-1: cpu=2 memory=1Gi | -| memory: 12Gi memory: 3Gi pod-2: cpu=2 memory=1Gi | -| pod-3: cpu=1 memory=1Gi(Terminating) | --------------------------------------------------------------------------------------- --------------------------------------------------------------------------- -| Queue-3 | -| Weight: 3 | -| | -| Deserved: | -| cpu: 3 | -| memory: 9Gi | -| | -| Allocated: Used: Pods: Preempting: | -| cpu: 1 cpu: 0 N/A cpu: 2 | -| memory: 9Gi memory: 0Gi memory: 0Gi | --------------------------------------------------------------------------- - -After pod-3 in Queue-1 and pod-3 in Queue-2 are terminated, Queue-3 resources will be updated --------------------------------------------------------------------------- -| Queue-1 | -| Weight: 2 | -| | -| Deserved: | -| cpu: 2 | -| memory: 6Gi | -| | -| Allocated: Used: Pods: | -| cpu: 2 cpu: 2 pod-2: cpu=2 memory=1Gi | -| memory: 6Gi memory: 1Gi | --------------------------------------------------------------------------- --------------------------------------------------------------------------- -| Queue-2 | -| Weight: 4 | -| | -| Deserved: | -| cpu: 4 | -| memory: 12Gi | -| | -| Allocated: Used: Pods: | -| cpu: 4 cpu: 4 pod-1: cpu=2 memory=1Gi | -| memory: 12Gi memory: 2Gi pod-2: cpu=2 memory=1Gi | --------------------------------------------------------------------------- --------------------------------------------------------------------------- -| Queue-3 | -| Weight: 3 | -| | -| Deserved: | -| cpu: 3 | -| memory: 9Gi | -| | -| Allocated: Used: Pods: Preempting: | -| cpu: 3 cpu: 0 N/A cpu: 0 | -| memory: 9Gi memory: 0Gi memory: 0Gi | --------------------------------------------------------------------------- - -``` - -## Future work -* In preprogress and preemption stage, the pod will be chosen randomly to kill. This may cause some more important pods killed. To solve this case, the following strategy can be used to choose pod - * Priority. Each pod has a priority, the lower priority pod will be selected first. - * Status. The pending pod will be selected first and then running pod will be selected. - * Runningtime. The pod with short running time will be selected first. -* Only `Queue` level (or namespace level) preemption is supported. `QueueJob` level preemption is not, and its behaviour will be same as `Queue` level. diff --git a/doc/design/queue_api.md b/doc/design/queue_api.md deleted file mode 100644 index aa885eaca..000000000 --- a/doc/design/queue_api.md +++ /dev/null @@ -1,185 +0,0 @@ -# Queue API - -@jinzhejz, 10/18/2017 - -@k82cn, 9/16/2017 - -## Overview - -[Resource sharing architecture for batch and serving workloads in Kubernetes](https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA/edit#) proposed -`QueueJob` feature to run batch job with services workload in Kubernetes. Considering the complexity, the -whole batch job proposal was separated into two phase: `Queue` and `QueueJob`. This document -presents the API definition of `Queue` for MVP. - -### Scope - - * In MVP, the resource request of `QueueJob` are ignored - * In MVP, the policy allocates resource based on `Queue`'s configuration - * In MVP, `Queue` is namespace level - -## Terminology - - * Deserved (Resource): The total number of resources that the batch allocated to the namespace - * Overused: The namespace is overused if it used more resources than deserved resources - * Underused: The namespace is underused if it used less resource than deserved resources - -## API - -```go -type Queue struct { - metav1.TypeMeta `json:",inline"` - metav1.ObjectMeta `json:"metadata"` - - Spec QueueSpec - Status QueueStatus -} - -type QueueSpec struct { - metav1.TypeMeta `json:",inline"` - metav1.ObjectMeta `json:"metadata"` - - // The weight of Queue, which is used by policy to allocate resource; the - // default value is 1. NOTE: it can not expect allocating more resouce with - // higher weight, it dependent on policy's reaction to the weight. - Weight int - - // The resource request of Queue, which is used by policy to allocate resource. - Request ResourceList -} - -type QueueStatus struct { - metav1.TypeMeta `json:",inline"` - metav1.ObjectMeta `json:"metadata"` - - // The deserved resource of Queue according to the policy - Deserved ResourceList - // The resources that allocated to Queue, the allocated resource is less or - // equal to `Deserved`: - // * if some resource was preempting, the Allocated is less then Deserved - // * otherwise, Allocated equals to Deserved - Allocated ResourceList - // The resource that used by Pod in namespace; if more resource was used than - // Deserved, the overused resource will be preempted. - Used ResourceList - // The resources that are preempting for Queue - Preempting ResourceList -} -``` - -## Function Detail - -### Workflow -![workflow](../images/workflow.jpg) - -### Admission Controller - -### Quota Manager -Only Quota Manager can update Resource Quota. And it has two responsibility: - -* Periodically query Queue status which contains allocated resources information from API server. Status struct is defined as `QueueStatus` -* Update `Allocated ` information in `QueueStatus` into Resource Quota - -### Scheduler Cache - -Scheduler Cache periodically fetches all Node/Pod/Queue information in the cluster from API server. That information will only be stored in memory and not persisted on disk. - -It provides two interfaces `Run()` and `Dump()` - -* `Run()` to trigger cache to periodically fetch Node/Pod/Queue information from API server -* `Dump()` create `Snapshot` for policy - -```go -type Cache interface { - // trigger cache to fetch Node/Pod/Queue - // information periodically from API server - Run(stopCh <-chan struct{}) - - // Dump deep copy overall Node/Pod/Queue information into Snapshot - Dump() *Snapshot -} - -type Snapshot struct { - Pods []*PodInfo - Nodes []*NodeInfo - Queues []*QueueInfo - Queuejobs []*QueuejobInfo -} -``` - -### Proportion Policy - -The policy creates a summary of usable resources(CPU and memory) on all nodes and allocates them to each Queue by `Weight` and `Request` in `QueueSpec` according to max-min weighted fairness algorithm. `Pods` is not used in the policy, it is for preemption in next step. - -``` -Snapshot information: ------------------- ------------------ -| Node-1 | | Node-2 | -| cpu: 6 | | cpu: 3 | -| memory: 15Gi | | memory: 12Gi | ------------------- ------------------ --------------------------- -------------------------- -| Queue-1 | | Queue-2 | -| Weight: 2 | | Weight: 4 | -| Request: cpu=5 | | Request: cpu=10 | -| memory=10Gi | | memory=20Gi | --------------------------- -------------------------- - -After policy scheduling: ---------------------------- --------------------------- -| Queue-1 | | Queue-2 | -| Weight: 2 | | Weight: 4 | -| Request: cpu=5 | | Request: cpu=10 | -| memory=10Gi | | memory=20Gi | -| | | | -| Deserved: | | Deserved: | -| cpu: 3 | | cpu: 6 | -| memory: 9Gi | | memory: 18Gi | ---------------------------- --------------------------- -``` - -Policy format scheduler results as `QueueInfo` and transfers to Preemption for next step. - -```go -type QueueInfo struct { - // The name of the queue - name string - // Queue information contains Deserved/Allocated/Used/Preempting - queue *Queue - // Running pods under this queue - pods map[string]*Pod -} -``` - -### Preemption - -Preemption is used to reclaim resource for overused(`Deserved` < `Allocated`) queue. The following status will be guaranteed after preemption done. - -* The cluster won't be overused. -* Each queue must meet `Deserved = Allocated`, it means the queue will get all resources allocated to it. -* Each queue must meet `Allocated >= Used`, it means the queue won't be overused. -* `Preempting` of each queue must be empty. If it is not empty, it means some pods are terminated to release resources to this queue, but it is not finished. - -#### Brief workflow: - -* Preprocess of `QueueInfo`. Currently, it terminates pods of overused(`Allocated < Used`) queue without preemption. Deserved/Allocated/Used/Preempting will not be adjusted in this stage. -* Adjust Deserved/Allocated/Used/Preempting of each Queue to trigger preemption, then update results to each Queue. -* Terminate running pods for each Queue which need preemption, and update Allocated of Queue after preempted pod is finished. - -```go -type Interface interface { - // Run start pod informer to handle terminating pods - Run(stopCh <-chan struct{}) - - // Preprocessing preprocess for each queue - // Currently, it terminates pods of overused(Allocated < Used) queue without preemption - // Deserved/Allocated/Used/Preempting will not be adjusted - Preprocessing(queues map[string]*schedulercache.QueueInfo, pods []*schedulercache.PodInfo) (map[string]*schedulercache.QueueInfo, error) - - // PreemptResources preempt resources between queue - // Deserved/Allocated/Used/Preempting will be adjusted - PreemptResources(queues map[string]*schedulercache.QueueInfo) error -} -``` - -### Queuejob -The proportion policy just used Queue `weight` to allocate resources now, it does not consider the queue job information, such as priority, resources requirement or some other factors. And we need provide some complex scheduling strategy to consider these queue job information. diff --git a/doc/design/queuejob_api.md b/doc/design/queuejob_api.md deleted file mode 100644 index 4b8d56384..000000000 --- a/doc/design/queuejob_api.md +++ /dev/null @@ -1,57 +0,0 @@ -# QueueJob API - -@jinzhejz, 12/22/2017 - -## Overview -[Resource sharing architecture for batch and serving workloads in Kubernetes](https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA/edit#) proposed -`QueueJob` feature to run batch job with services workload in Kubernetes. Considering the complexity, the whole batch job proposal was separated into two phase: `Queue` and `QueueJob`. This document presents the API definition of `QueueJob` and feature interaction with `Queue`. - -## API -```go -// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object -type QueueJob struct { - metav1.TypeMeta `json:",inline"` - metav1.ObjectMeta `json:"metadata"` - Spec QueueJobSpec `json:"spec"` - Status QueueJobStatus `json:"status,omitempty"` -} - -type QueueJobSpec struct { - // Priority of the QueueJob, higher priority QueueJob gets resources first - Priority int `json:"priority"` - // ResourceUnit * ResourceNo = total resource of QueueJob - ResourceUnit ResourceList `json:"resourceunit"` - ResourceNo int `json:"resourceno"` - // The Queue which the QueueJob belongs to - Queue string `json:"queue"` -} - -type QueueJobStatus struct { - // The resources allocated to the QueueJob - Allocated ResourceList `json:"allocated"` -} - -// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object -type QueueJobList struct { - metav1.TypeMeta `json:",inline"` - metav1.ListMeta `json:"metadata"` - Items []QueueJob `json:"items"` -} -``` - -## Function details -### Workflow -![workflow](../images/queuejob.jpg) - -It is basically the same as the workflow in Queue API document (`QuotaManager` is not included in above workflow). The difference is just including `QueueJob` in `Queue`. - -A `Queue` can include 0 or more `QueueJob`. - -* If a Queue includes 0 QueueJob, its resource request is same as before. Such as `q03` in above. -* If a Queue includes 1 or more QueueJob, the resource request of the queue equals the sum of all QueueJob resource request. Such as `q01` and `q02` in above. - -For Queue `q01` and `q02`, Kube-batch will assign resources to their QueueJob directly. -For Queue `q03`, Kube-batch will just assign resources to the Queue. - -## Future work -* Now QueueJob is associated not with the real batch job, users who want to submit a batch job need to create their own QueueJob and watch the QueueJob, then submit their batch job after kube-batch assign resources to QueueJob. \ No newline at end of file