Skip to content

Latest commit

 

History

History
65 lines (47 loc) · 2.72 KB

9-top-job-gpu-metric.md

File metadata and controls

65 lines (47 loc) · 2.72 KB

The command arena top job <job name> can display GPU monitoring metrics. Before using it, you must deploy a Prometheus and nodeExporter for GPU Metrics.

1. Deploy a Prometheus

kubectl apply -f kubernetes-artifacts/prometheus/prometheus.yaml

2. Deploy GPU node exporter

  • If your cluster is ACK (Alibaba Cloud Kubernetes) cluster, you can just exec command:
# change gpu export nodeSelector to aliyun label
sed -i 's|accelerator/nvidia_gpu|aliyun.accelerator/nvidia_count|g' kubernetes-artifacts/prometheus/gpu-expoter.yaml
  • If your cluster is not ACK cluster, you need to label your GPU node:
# label all your GPU nodes
kubectl label node <your GPU node> accelerator/nvidia_gpu=true
  • Deploy gpu exporter
kubectl apply -f kubernetes-artifacts/prometheus/gpu-exporter.yaml

Notice: the prometheus and gpu-exporter components should be deployed in namespace kube-system, and so that arena top job <job name> can work.

3. You can check the GPU metrics by prometheus SQL request

# kubectl get --raw '/api/v1/namespaces/arena-system/services/prometheus-svc:prometheus/proxy/api/v1/query?query=nvidia_gpu_num_devices'

{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"nvidia_gpu_num_devices","app":"node-gpu-exporter","instance":"172.16.1.144:9445","job":"kubernetes-service-endpoints","k8s_app":"node-gpu-exporter","kubernetes_name":"node-gpu-exporter","node_name":"mynode"},"value":[1543202894.919,"2"]}]}}

4. Submit a traing job by arena

arena submit tf --name=style-transfer              \
              --gpus=2              \
              --workers=2              \
              --workerImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/neural-style:gpu \
              --workingDir=/neural-style \
              --ps=1              \
              --psImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/style-transfer:ps   \
              "python neural_style.py --styles /neural-style/examples/1-style.jpg --iterations 1000000"

5. Check GPU metrics for the job you deployed

# arena top job style-transfer
INSTANCE NAME                  STATUS   NODE          GPU(Device Index)  GPU(Duty Cycle)  GPU(Memory MiB)
style-transfer-tfjob-ps-0      Running  192.168.0.95  N/A                N/A              N/A
style-transfer-tfjob-worker-0  Running  192.168.0.98  0                  98%              15641MiB / 16276MiB
                                                      1                  0%               15481MiB / 16276MiB
style-transfer-tfjob-worker-1  Running  192.168.0.99  0                  98%              15641MiB / 16276MiB
                                                      1                  0%               15481MiB / 16276MiB