Reclaim Enhancement: Enqueue action may block the process of `reclaim` action #569

sivanzcw · 2019-11-27T13:31:55Z

/kind feature

If a queue has occupied most of cluster resources, when there are pods need to be scheduled in new queue, reclaim action may be blocked due to job of the new pod can not be refreshed to inqueue status by enqueue action

cluster resources

serial	node name	resource
1	node1	4c8g
2	node2	4c8g

queue status

serial	queue name	weight	quota	status
1	queue2	1	4c 8g	overused
2	queue3	1	4c 8g	active

create joba with 7 pods, each pods have 1c1.5g resource requirement, minA of joba is 1, joba was placed in queue2

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mxnet-job-queue1
spec:
  minAvailable: 1
  schedulerName: volcano
  priorityClassName: zjh-higher
  queue: queue2
  policies:
  - event: PodEvicted
    action: RestartJob
  - event: PodFailed
    action: RestartJob
  plugins:
    svc: []
  tasks:
  - replicas: 1
    name: worker
    template:
      spec:
        imagePullSecrets:
        - name: default-secret
        containers:
        - image: volcanosh/mxnet-train-mnist-cpu:v1
          args:
          - --kv-store=dist_sync
          imagePullPolicy: IfNotPresent
          name: mxnet
          resources:
            limits:
              cpu: "1"
              memory: "1.5Gi"
            requests:
              cpu: "1"
              memory: "1.5Gi"
          env:
          - name: DMLC_PS_ROOT_PORT
            value: "9000"
          - name: DMLC_PS_ROOT_URI
            value: mxnet-job-scheduler-0.mxnet-job
          - name: DMLC_NUM_SERVER
            value: "2"
          - name: DMLC_NUM_WORKER
            value: "2"
          - name: DMLC_ROLE
            value: "worker"
          - name: DMLC_USE_KUBERNETES
            value: "1"
        restartPolicy: OnFailure
  - replicas: 2
    name: server
    template:
      spec:
        imagePullSecrets:
        - name: default-secret
        containers:
        - image: volcanosh/mxnet-train-mnist-cpu:v1
          imagePullPolicy: IfNotPresent
          name: mxnet
          resources:
            limits:
              cpu: "1"
              memory: "1.5Gi"
            requests:
              cpu: "1"
              memory: "1.5Gi"
          env:
          - name: DMLC_PS_ROOT_PORT
            value: "9000"
          - name: DMLC_PS_ROOT_URI
            value: mxnet-job-scheduler-0.mxnet-job
          - name: DMLC_NUM_SERVER
            value: "2"
          - name: DMLC_NUM_WORKER
            value: "2"
          - name: DMLC_ROLE
            value: "server"
          - name: DMLC_USE_KUBERNETES
            value: "1"
        restartPolicy: OnFailure
  - replicas: 4
    name: scheduler
    template:
      spec:
        imagePullSecrets:
        - name: default-secret
        containers:
        - image: volcanosh/mxnet-train-mnist-cpu:v1
          imagePullPolicy: IfNotPresent
          name: mxnet
          resources:
            limits:
              cpu: "1"
              memory: "1.5Gi"
            requests:
              cpu: "1"
              memory: "1.5Gi"
          env:
          - name: DMLC_PS_ROOT_PORT
            value: "9000"
          - name: DMLC_PS_ROOT_URI
            value: mxnet-job-scheduler-0.mxnet-job
          - name: DMLC_NUM_SERVER
            value: "2"
          - name: DMLC_NUM_WORKER
            value: "2"
          - name: DMLC_ROLE
            value: "scheduler"
          - name: DMLC_USE_KUBERNETES
            value: "1"
        restartPolicy: OnFailure

all pods of joba will be Running
create jobb with one pod, the request resources of pod is 2c2g, jobb is placed in queue3

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mxnet-job-default
spec:
  minAvailable: 1
  schedulerName: volcano
  priorityClassName: zjh-higher
  queue: queue3
  plugins:
    svc: []
  tasks:
  - replicas: 1
    name: worker
    template:
      spec:
        imagePullSecrets:
        - name: default-secret
        containers:
        - image: volcanosh/mxnet-train-mnist-cpu:v1
          args:
          - --kv-store=dist_sync
          imagePullPolicy: IfNotPresent
          name: mxnet
          resources:
            limits:
              cpu: "2"
              memory: "2Gi"
            requests:
              cpu: "2"
              memory: "2Gi"
          env:
          - name: DMLC_PS_ROOT_PORT
            value: "9000"
          - name: DMLC_PS_ROOT_URI
            value: mxnet-job-scheduler-0.mxnet-job
          - name: DMLC_NUM_SERVER
            value: "2"
          - name: DMLC_NUM_WORKER
            value: "2"
          - name: DMLC_ROLE
            value: "worker"
          - name: DMLC_USE_KUBERNETES
            value: "1"
        restartPolicy: OnFailure

expected that pod of jobb evicted two pods of joba, actually, eviction was not happen, jobb was pending. Since, idle cluster resources was not afford to satisfy the resource requirement of jobb, podgroup of jobb will be remain pending. Job with pending phase can not evict pods of other job.

The text was updated successfully, but these errors were encountered:

k82cn · 2019-11-28T01:03:02Z

/kind bug
/priority import-soon

volcano-sh-bot · 2019-11-28T01:03:07Z

@k82cn: The label(s) priority/import-soon cannot be applied. These labels are supported: ``

In response to this:

/kind bug
/priority import-soon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sivanzcw · 2019-12-11T09:02:41Z

solved in #587

volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 28, 2019

k82cn added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 28, 2019

sivanzcw mentioned this issue Dec 6, 2019

Add arguments for action #587

Merged

sivanzcw closed this as completed Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reclaim Enhancement: Enqueue action may block the process of `reclaim` action #569

Reclaim Enhancement: Enqueue action may block the process of `reclaim` action #569

sivanzcw commented Nov 27, 2019

k82cn commented Nov 28, 2019

volcano-sh-bot commented Nov 28, 2019

sivanzcw commented Dec 11, 2019

Reclaim Enhancement: Enqueue action may block the process of reclaim action #569

Reclaim Enhancement: Enqueue action may block the process of reclaim action #569

Comments

sivanzcw commented Nov 27, 2019

k82cn commented Nov 28, 2019

volcano-sh-bot commented Nov 28, 2019

sivanzcw commented Dec 11, 2019

Reclaim Enhancement: Enqueue action may block the process of `reclaim` action #569

Reclaim Enhancement: Enqueue action may block the process of `reclaim` action #569