Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reclaim Enhancement: Enqueue action may block the process of reclaim action #569

Closed
sivanzcw opened this issue Nov 27, 2019 · 3 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@sivanzcw
Copy link
Contributor

/kind feature

If a queue has occupied most of cluster resources, when there are pods need to be scheduled in new queue, reclaim action may be blocked due to job of the new pod can not be refreshed to inqueue status by enqueue action

  • cluster resources
serial node name resource
1 node1 4c8g
2 node2 4c8g
  • queue status
serial queue name weight quota status
1 queue2 1 4c 8g overused
2 queue3 1 4c 8g active
  • create joba with 7 pods, each pods have 1c1.5g resource requirement, minA of joba is 1, joba was placed in queue2
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mxnet-job-queue1
spec:
  minAvailable: 1
  schedulerName: volcano
  priorityClassName: zjh-higher
  queue: queue2
  policies:
  - event: PodEvicted
    action: RestartJob
  - event: PodFailed
    action: RestartJob
  plugins:
    svc: []
  tasks:
  - replicas: 1
    name: worker
    template:
      spec:
        imagePullSecrets:
        - name: default-secret
        containers:
        - image: volcanosh/mxnet-train-mnist-cpu:v1
          args:
          - --kv-store=dist_sync
          imagePullPolicy: IfNotPresent
          name: mxnet
          resources:
            limits:
              cpu: "1"
              memory: "1.5Gi"
            requests:
              cpu: "1"
              memory: "1.5Gi"
          env:
          - name: DMLC_PS_ROOT_PORT
            value: "9000"
          - name: DMLC_PS_ROOT_URI
            value: mxnet-job-scheduler-0.mxnet-job
          - name: DMLC_NUM_SERVER
            value: "2"
          - name: DMLC_NUM_WORKER
            value: "2"
          - name: DMLC_ROLE
            value: "worker"
          - name: DMLC_USE_KUBERNETES
            value: "1"
        restartPolicy: OnFailure
  - replicas: 2
    name: server
    template:
      spec:
        imagePullSecrets:
        - name: default-secret
        containers:
        - image: volcanosh/mxnet-train-mnist-cpu:v1
          imagePullPolicy: IfNotPresent
          name: mxnet
          resources:
            limits:
              cpu: "1"
              memory: "1.5Gi"
            requests:
              cpu: "1"
              memory: "1.5Gi"
          env:
          - name: DMLC_PS_ROOT_PORT
            value: "9000"
          - name: DMLC_PS_ROOT_URI
            value: mxnet-job-scheduler-0.mxnet-job
          - name: DMLC_NUM_SERVER
            value: "2"
          - name: DMLC_NUM_WORKER
            value: "2"
          - name: DMLC_ROLE
            value: "server"
          - name: DMLC_USE_KUBERNETES
            value: "1"
        restartPolicy: OnFailure
  - replicas: 4
    name: scheduler
    template:
      spec:
        imagePullSecrets:
        - name: default-secret
        containers:
        - image: volcanosh/mxnet-train-mnist-cpu:v1
          imagePullPolicy: IfNotPresent
          name: mxnet
          resources:
            limits:
              cpu: "1"
              memory: "1.5Gi"
            requests:
              cpu: "1"
              memory: "1.5Gi"
          env:
          - name: DMLC_PS_ROOT_PORT
            value: "9000"
          - name: DMLC_PS_ROOT_URI
            value: mxnet-job-scheduler-0.mxnet-job
          - name: DMLC_NUM_SERVER
            value: "2"
          - name: DMLC_NUM_WORKER
            value: "2"
          - name: DMLC_ROLE
            value: "scheduler"
          - name: DMLC_USE_KUBERNETES
            value: "1"
        restartPolicy: OnFailure

  • all pods of joba will be Running

  • create jobb with one pod, the request resources of pod is 2c2g, jobb is placed in queue3

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mxnet-job-default
spec:
  minAvailable: 1
  schedulerName: volcano
  priorityClassName: zjh-higher
  queue: queue3
  plugins:
    svc: []
  tasks:
  - replicas: 1
    name: worker
    template:
      spec:
        imagePullSecrets:
        - name: default-secret
        containers:
        - image: volcanosh/mxnet-train-mnist-cpu:v1
          args:
          - --kv-store=dist_sync
          imagePullPolicy: IfNotPresent
          name: mxnet
          resources:
            limits:
              cpu: "2"
              memory: "2Gi"
            requests:
              cpu: "2"
              memory: "2Gi"
          env:
          - name: DMLC_PS_ROOT_PORT
            value: "9000"
          - name: DMLC_PS_ROOT_URI
            value: mxnet-job-scheduler-0.mxnet-job
          - name: DMLC_NUM_SERVER
            value: "2"
          - name: DMLC_NUM_WORKER
            value: "2"
          - name: DMLC_ROLE
            value: "worker"
          - name: DMLC_USE_KUBERNETES
            value: "1"
        restartPolicy: OnFailure
  • expected that pod of jobb evicted two pods of joba, actually, eviction was not happen, jobb was pending. Since, idle cluster resources was not afford to satisfy the resource requirement of jobb, podgroup of jobb will be remain pending. Job with pending phase can not evict pods of other job.
@k82cn
Copy link
Member

k82cn commented Nov 28, 2019

/kind bug
/priority import-soon

@volcano-sh-bot volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 28, 2019
@volcano-sh-bot
Copy link
Contributor

@k82cn: The label(s) priority/import-soon cannot be applied. These labels are supported: ``

In response to this:

/kind bug
/priority import-soon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k82cn k82cn added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 28, 2019
@sivanzcw
Copy link
Contributor Author

solved in #587

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

3 participants