diff --git a/cmd/gpu_plugin/README.md b/cmd/gpu_plugin/README.md index 54999fef8..3eab749e0 100644 --- a/cmd/gpu_plugin/README.md +++ b/cmd/gpu_plugin/README.md @@ -53,7 +53,7 @@ For workloads on different KMDs, see [KMD and UMD](#kmd-and-umd). | Flag | Argument | Default | Meaning | |:---- |:-------- |:------- |:------- | -| -enable-monitoring | - | disabled | Enable 'i915_monitoring' resource that provides access to all Intel GPU devices on the node | +| -enable-monitoring | - | disabled | Enable '*_monitoring' resource that provides access to all Intel GPU devices on the node, [see use](./monitoring.md) | | -resource-manager | - | disabled | Enable fractional resource management, [see use](./fractional.md) | | -shared-dev-num | int | 1 | Number of containers that can share the same GPU device | | -allocation-policy | string | none | 3 possible values: balanced, packed, none. For shared-dev-num > 1: _balanced_ mode spreads workloads among GPU devices, _packed_ mode fills one GPU fully before moving to next, and _none_ selects first available device from kubelet. Default is _none_. Allocation policy does not have an effect when resource manager is enabled. | diff --git a/cmd/gpu_plugin/monitoring.md b/cmd/gpu_plugin/monitoring.md new file mode 100644 index 000000000..3b3050aeb --- /dev/null +++ b/cmd/gpu_plugin/monitoring.md @@ -0,0 +1,32 @@ +# Monitoring GPUs + +## i915_monitoring resource + +GPU plugin can be configured to register a monitoring resource for the nodes that have Intel GPUs on them. `gpu.intel.com/i915_monitoring` (or `gpu.intel.com/xe_monitoring`) is a singular resource on the nodes. A container requesting it, will get access to _all_ the Intel GPUs (`i915` or `xe` KMD device files) on the node. The idea behind this resource is to allow the container to _monitor_ the GPUs. A container requesting the `i915_monitoring` resource would typically export data to some metrics consumer. An example for such a consumer is [Prometheus](https://prometheus.io/). + +
+ +
Monitoring Pod listening to all GPUs while one Pod is using a GPU.
+
+ +For the monitoring applications, there are two possibilities: [Intel XPU Manager](https://github.com/intel/xpumanager/) and [collectd](https://github.com/collectd/collectd/tree/collectd-6.0). Intel XPU Manager is readily available as a container and with a deployment yaml. collectd has Intel GPU support in its 6.0 branch, but there are no public containers available for it. + +To deploy XPU Manager to a cluster, one has to run the following kubectl: +``` +$ kubectl apply -k https://github.com/intel/xpumanager/deployment/kubernetes/daemonset/base +``` + +This will deploy an XPU Manager daemonset to run on all the nodes having the `i915_monitoring` resource. + +## Prometheus integration with XPU Manager + +For deploying Prometheus to a cluster, see [this page](https://prometheus-operator.dev/docs/user-guides/getting-started/). One can also use Prometheus' [helm chart](https://github.com/prometheus-community/helm-charts). + +Prometheus requires additional Kubernetes configuration so it can fetch GPU metrics. The following steps will add a Kubernetes Service and a ServiceMonitor components. The components instruct Prometheus how and where from to retrieve the metrics. + +``` +$ kubectl apply -f https://raw.githubusercontent.com/intel/xpumanager/master/deployment/kubernetes/monitoring/service-intel-xpum.yaml +$ kubectl apply -f https://raw.githubusercontent.com/intel/xpumanager/master/deployment/kubernetes/monitoring/servicemonitor-intel-xpum.yaml +``` + +With those components in place, one can query Intel GPU metrics from Prometheus with `xpum_` prefix. diff --git a/cmd/gpu_plugin/monitoring.png b/cmd/gpu_plugin/monitoring.png new file mode 100644 index 000000000..c56fc5057 Binary files /dev/null and b/cmd/gpu_plugin/monitoring.png differ diff --git a/deployments/xpumanager_sidecar/kustom/kustom_xpumanager.yaml b/deployments/xpumanager_sidecar/kustom/kustom_xpumanager.yaml index 69acf5898..3ce726271 100644 --- a/deployments/xpumanager_sidecar/kustom/kustom_xpumanager.yaml +++ b/deployments/xpumanager_sidecar/kustom/kustom_xpumanager.yaml @@ -27,8 +27,3 @@ spec: - ALL readOnlyRootFilesystem: true runAsUser: 0 - - name: xpumd - resources: - limits: - $patch: replace - gpu.intel.com/i915_monitoring: 1 diff --git a/deployments/xpumanager_sidecar/kustomization.yaml b/deployments/xpumanager_sidecar/kustomization.yaml index 728397536..e60da135b 100644 --- a/deployments/xpumanager_sidecar/kustomization.yaml +++ b/deployments/xpumanager_sidecar/kustomization.yaml @@ -1,5 +1,5 @@ resources: -- https://raw.githubusercontent.com/intel/xpumanager/V1.2.18/deployment/kubernetes/daemonset-intel-xpum.yaml +- https://raw.githubusercontent.com/intel/xpumanager/V1.2.29/deployment/kubernetes/daemonset/base/daemonset-intel-xpum.yaml namespace: monitoring apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization