Skip to content

Commit

Permalink
feat(nodeResources): add gpu resource filter and example
Browse files Browse the repository at this point in the history
  • Loading branch information
DexterYan committed Dec 30, 2024
1 parent e3b5d40 commit a4a3273
Showing 1 changed file with 45 additions and 0 deletions.
45 changes: 45 additions & 0 deletions docs/source/analyze/node-resources.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ All filters can be integers or strings that are parsed using the Kubernetes reso
| `ephemeralStorageAllocatable` | The amount of ephemeral storage on the node after Kubernetes is running |
| `matchLabel` | Specific selector label or labels the node must contain in its metadata |
| `matchExpressions` | A list of selector label expressions that the node needs to match in its metadata |
| `resourceName` | The name of the resource to filter on. This is useful for filtering on custom resources |
| `resourceCapacity` | The amount of the resource available to the node. |
| `resourceAllocatable` | The amount of allocatable resource after the Kubernetes components have been started |


CPU and Memory units are expressed as Go [Quantities](https://pkg.go.dev/k8s.io/apimachinery/pkg/api/resource#Quantity): `16Gi`, `8Mi`, `1.5m`, `5` etc.
Expand Down Expand Up @@ -184,6 +187,48 @@ Troubleshoot allows users to analyze nodes that match one or more labels. For ex
message: "{{ .NodeCount }} nodes do not meet the minimum requirements"
```
### Filter by GPU resources
resoucrceName is used to filter on custom resources. For example, to filter on GPU resources, you can use the resourceName filter with the resource name `nvidia.com/gpu`.
resourceCapacity and resourceAllocatable filters are used to filter on the capacity and allocatable resources of the custom resource.

```yaml
- nodeResources:
checkName: Must have at least 1 node with 1 GPU
filters:
resourceName: nvidia.com/gpu
resourceCapacity: "1"
outcomes:
- pass:
when: "count() >= 1"
message: "Found {{ .NodeCount }} nodes with at least 1 GPU"
- fail:
message: "{{ .NodeCount }} nodes do not meet the minimum requirements"
```

```yaml
- nodeResources:
checkName: Must have at least 4 Intel i915 GPUs in the cluster
filters:
resourceName: gpu.intel.com/i915
outcomes:
- pass:
when: "min(resourceAllocatable) > 4"
message: "This application requires at least 4 Intel i915 GPUs"
- fail:
message: "{{ .NodeCount }} nodes do not meet the minimum requirements"
```

```yaml
- nodeResources:
filters:
resourceName: nvidia.com/gpu
checkName: Must have at least 3 GPU-enabled nodes in the cluster
outcomes:
- pass:
when: "count() >= 3"
message: "This application requires at least 3 GPU-enabled nodes"
```

## Message Templating
To make the outcome message more informative, you can include certain values gathered by the NodeResources collector as templates. The templates are enclosed in double curly braces with a dot separator. The following templates are available:

Expand Down

0 comments on commit a4a3273

Please sign in to comment.