Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodegroup plugin design doc #2227

Merged
merged 1 commit into from
Jun 28, 2022

Conversation

qiankunli
Copy link
Contributor

@qiankunli qiankunli commented May 12, 2022

design doc for #2224
related issue #1830

@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 12, 2022
- name: drf
- name: predicates
- name: proportion
- name: nodegroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it's a common scenaro. Let's discuss to enhance queue to support it insted of a plugin.

@whybeyoung
Copy link
Contributor

whybeyoung commented May 13, 2022

image

@qiankunli 这边看了下设计,提出一点我这边看法,看看能不能考虑进去。

从租户资源视角:

租户hard资源: 即为 labeled标签资源, nlp的任务在:

1、集群资源相对空闲的情况下: 即queue=nlp的集群中跑了非nlp的task,此时如果集群资源允许(有空闲), nlp的task 可以调度到其他节点

2、集群资源紧张情况下: 即queue=nlp的集群中跑了非nlp的 task,但是,集群又没有资源分发新来的nlp task[同时 queue的nlp总task资源没有超过其labeled资源],此时对queue-nlp中的非nlp任务进行驱逐或者停止。

每个租户可以都遵从上述的基准策略

租户的soft资源: 超出其标定节点总量的任务,均为 弹性的soft资源, soft资源根据实际运行情况,运用一定的打分算法允许被抢占 ,驱逐,停止 。

比较坏的情况,每个queue均完整跑满各自的资源;大家其乐融融,互不亏欠
比较好的情况,大家 分时复用,可取所需,感觉集群永远满足自己的刚性诉求。

最坏的情况,大家业务时间都很重合,资源完全无法复用(这个就可以,对queue设置一些reserve机制)
相对坏的情况, 我的软资源任务且耗时任务快跑完了,被别人干掉了 [这时候训练场景需要引入checkpoint机制\断点恢复训练]

从资源本身角度:

之所以分上述的租户,本质上还是由于资源本身之间的有别性, 你的资源,我的资源,volcano调度当前对资源的 表述维度,暂且 不能表述资源本身的一些属性, cpu型号,gpu卡型号,这些根据不同用户可能有不同的策略,所以根据类似类似nodegroup这种标签,甚至多重标签 ,我们需要基于它们来做一些特定场景调度

@qiankunli
Copy link
Contributor Author

qiankunli commented May 15, 2022

how can we describe the relationship between queue and nodegroup in queue.spec?

example1, we can add v1.Affinity(in k8s core api) in queue.spec directly, it also means that we describe the relationship between queue and node(not nodegroup)

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: default
spec:
  guarantee: {}
  reclaimable: true
  weight: 1
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: labelName
                operator: In
                values:
                  - labelValue
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: labelName
                operator: NotIn
                values:
                  - labelValue

example2, we can add a label(such as the volcano.sh/node-group) on node, and describe the relationship between queue and nodegroup simply.

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: default
spec:
  guarantee: {}
  reclaimable: true
  weight: 1
  affinity:
     nodeGroupAffinity:
        required:
        - groupname1
        - gropuname2
       preferred
       - groupname1

@qiankunli
Copy link
Contributor Author

qiankunli commented May 20, 2022

5.20社区讨论意见:

queue 和nodegroup 亲和性机制的引入可能会带来风险: 一个job 按照queue 的规则可以运行,但是根据nodegroup亲和性规则找不到可以运行的节点,一直处于pending 状态。本质就是 很容易queue 与 nodegroup 的资源配置不一致。

首先,“特定任务运行在特定类型的节点上” 这个需求是确实存在的,如果没有nodegroup机制,实际用户会使用污点等机制,也会有这个问题。

为了缓解这个问题,有两个办法

  1. “queue 和nodegroup 亲和性配置” 可以作为nodegroup plugin的参数,用这个plugin 的人承担风险,不用这个plugin 的人可以不考虑这个事儿,缺点是:配置复杂,用户也无法通过kubectl 等查看。所以还是倾向于将 亲和性配置 作为queue的一个属性。
  2. 我们可以 根据 “queue 和nodegroup 亲和性配置” 计算一个queue 可用的资源上限,使得queue.capacity = min(用户手工配置的capacity, 可用的nodegroup资源上限)。但这个可能使得 proportion plugin代码过于复杂,我们暂时先搁置这个风险,由使用这个机制的人来承担。看后续需求的演化。

“queue 和nodegroup 亲和性配置” 作为queue 的配置 第二种方案可读性更好些,也保留了nodegroup 概念,因此使用第二种方案。

@k82cn
Copy link
Member

k82cn commented May 23, 2022

Is queue vs. nodeGroup 1:1 ? If not, how to balance resources, .e.g. how to preempt/reclaim, how to allocate resources cross nodeGroup?

queue 和nodegroup 亲和性机制的引入可能会带来风险: 一个job 按照queue 的规则可以运行,但是根据nodegroup亲和性规则找不到可以运行的节点,一直处于pending 状态。本质就是 很容易queue 与 nodegroup 的资源配置不一致。

The user/admin should make sure the configuration is reasonable, what we can do is to report the related pending reason.

```

risk: The resources of the queue can not be too different from the resources of the node-group, otherwise it is easy that task can be scheduled to run from the perspective of the queue, but cannot find a suitable node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clariy the following thing in the design doc

  1. Queue API change:
  2. Command line:
    both kubectl and vcctl support display nodegroup info in queue querying via CLI ?

Signed-off-by: qiankunli <[email protected]>

update design

Signed-off-by: qiankunli <[email protected]>

fix doc

Signed-off-by: qiankunli <[email protected]>

fix doc

Signed-off-by: qiankunli <[email protected]>
Copy link
Member

@william-wang william-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 28, 2022
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2022
@volcano-sh-bot volcano-sh-bot merged commit cb39985 into volcano-sh:master Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. retest-not-required-docs-only size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants