Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(nodeResources): add gpu resource filter and example #602

Merged
merged 1 commit into from
Jan 6, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions docs/source/analyze/node-resources.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ All filters can be integers or strings that are parsed using the Kubernetes reso
| `ephemeralStorageAllocatable` | The amount of ephemeral storage on the node after Kubernetes is running |
| `matchLabel` | Specific selector label or labels the node must contain in its metadata |
| `matchExpressions` | A list of selector label expressions that the node needs to match in its metadata |
| `resourceName` | The name of the resource to filter on. This is useful for filtering on custom resources |
| `resourceCapacity` | The amount of the resource available to the node. |
| `resourceAllocatable` | The amount of allocatable resource after the Kubernetes components have been started |


CPU and Memory units are expressed as Go [Quantities](https://pkg.go.dev/k8s.io/apimachinery/pkg/api/resource#Quantity): `16Gi`, `8Mi`, `1.5m`, `5` etc.
Expand Down Expand Up @@ -184,6 +187,48 @@ Troubleshoot allows users to analyze nodes that match one or more labels. For ex
message: "{{ .NodeCount }} nodes do not meet the minimum requirements"
```

### Filter by GPU resources
resoucrceName is used to filter on custom resources. For example, to filter on GPU resources, you can use the resourceName filter with the resource name `nvidia.com/gpu`.
resourceCapacity and resourceAllocatable filters are used to filter on the capacity and allocatable resources of the custom resource.

```yaml
- nodeResources:
checkName: Must have at least 1 node with 1 GPU
filters:
resourceName: nvidia.com/gpu
resourceCapacity: "1"
outcomes:
- pass:
when: "count() >= 1"
message: "Found {{ .NodeCount }} nodes with at least 1 GPU"
- fail:
message: "{{ .NodeCount }} nodes do not meet the minimum requirements"
```

```yaml
- nodeResources:
checkName: Must have at least 4 Intel i915 GPUs in the cluster
filters:
resourceName: gpu.intel.com/i915
outcomes:
- pass:
when: "min(resourceAllocatable) > 4"
message: "This application requires at least 4 Intel i915 GPUs"
- fail:
message: "{{ .NodeCount }} nodes do not meet the minimum requirements"
```

```yaml
- nodeResources:
filters:
resourceName: nvidia.com/gpu
checkName: Must have at least 3 GPU-enabled nodes in the cluster
outcomes:
- pass:
when: "count() >= 3"
message: "This application requires at least 3 GPU-enabled nodes"
```

## Message Templating
To make the outcome message more informative, you can include certain values gathered by the NodeResources collector as templates. The templates are enclosed in double curly braces with a dot separator. The following templates are available:

Expand Down