Skip to content

Latest commit

 

History

History
130 lines (104 loc) · 3.54 KB

how-to-use-volcano-vgpu.md

File metadata and controls

130 lines (104 loc) · 3.54 KB

Volcano vgpu device plugin for Kubernetes

Note:

You DON'T need to install HAMi when using volcano-vgpu, only use
Volcano vgpu device-plugin is good enough. It can provide device-sharing mechanism for NVIDIA devices managed by volcano.

This is based on Nvidia Device Plugin, it uses HAMi-core to support hard isolation of GPU card.

Volcano vgpu is only available in volcano > 1.9

Quick Start

Install Volcano

helm repo add volcano-sh https://volcano-sh.github.io/helm-charts helm install volcano volcano-sh/volcano -n volcano-system --create-namespace

Configure scheduler

update the scheduler configuration:

kubectl edit cm -n volcano-system volcano-scheduler-configmap
kind: ConfigMap
apiVersion: v1
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: deviceshare
        arguments:
          deviceshare.VGPUEnable: true # enable vgpu
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

Enabling GPU Support in Kubernetes

Once you have enabled this option on all the GPU nodes you wish to use, you can then enable GPU support in your cluster by deploying the following Daemonset:

$ kubectl create -f https://raw.githubusercontent.com/Project-HAMi/volcano-vgpu-device-plugin/main/volcano-vgpu-device-plugin.yml

Verify environment is ready

Check the node status, it is ok if volcano.sh/vgpu-number is included in the allocatable resources.

$ kubectl get node {node name} -oyaml
...
status:
  addresses:
  - address: 172.17.0.3
    type: InternalIP
  - address: volcano-control-plane
    type: Hostname
  allocatable:
    cpu: "4"
    ephemeral-storage: 123722704Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 8174332Ki
    pods: "110"
    volcano.sh/gpu-number: "10"    # vGPU resource
  capacity:
    cpu: "4"
    ephemeral-storage: 123722704Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 8174332Ki
    pods: "110"
    volcano.sh/gpu-memory: "89424"
    volcano.sh/gpu-number: "10"   # vGPU resource

Running VGPU Jobs

VGPU can be requested by both set "volcano.sh/vgpu-number" , "volcano.sh/vgpu-cores" and "volcano.sh/vgpu-memory" in resource.limit

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod1
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          volcano.sh/vgpu-number: 2 # requesting 2 gpu cards
          volcano.sh/vgpu-memory: 3000 # (optinal)each vGPU uses 3G device memory
          volcano.sh/vgpu-cores: 50 # (optional)each vGPU uses 50% core  
EOF

You can validate device memory using nvidia-smi inside container:

WARNING: if you don't request GPUs when using the device plugin with NVIDIA images all the GPUs on the machine will be exposed inside your container. The number of vgpu used by a container can not exceed the number of gpus on that node.

Monitor

volcano-scheduler-metrics records every GPU usage and limitation, visit the following address to get these metrics.

curl {volcano scheduler cluster ip}:8080/metrics