Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install refactor and ping-iperf bugfix #54

Merged
merged 19 commits into from
Oct 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Build and Push Container Image
name: Build and Push Latest Container Image

on:
workflow_dispatch:
Expand All @@ -25,7 +25,6 @@ jobs:
uses: docker/metadata-action@v5
with:
images: quay.io/autopilot/autopilot
tags: ${{ steps.meta.outputs.tags }}

- name: Log into registry
if: github.event_name != 'pull_request'
Expand All @@ -40,4 +39,4 @@ jobs:
with:
context: autopilot-daemon
push: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.meta.outputs.tags }}
tags: "latest"
48 changes: 44 additions & 4 deletions .github/workflows/publish-release.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name: Publish Release
# This is a basic workflow to help you get started with Actions

name: Create New Release - Quay and Helm
on:
push:
tags: '*'
workflow_dispatch:

jobs:
Expand All @@ -19,7 +19,6 @@ jobs:
run: |
git config user.name "$GITHUB_ACTOR"
git config user.email "[email protected]"

- name: Install Helm
uses: azure/setup-helm@v3
with:
Expand All @@ -33,5 +32,46 @@ jobs:
with:
pages_branch: gh-pages
charts_dir: helm-charts
skip_existing: true
packages_with_index: true
token: ${{ secrets.GITHUB_TOKEN }}

docker:
runs-on: ubuntu-latest
steps:
- name: Remove unnecessary files
run: |
sudo rm -rf /usr/share/dotnet
sudo rm -rf /usr/local/lib/android

- name: Checkout
uses: actions/checkout@v3
with:
fetch-depth: 0

- name: Read helm chart version
run: echo "CHART_VERSION=$(grep '^version:' helm-charts/autopilot/Chart.yaml | cut -d ":" -f2 | tr -d ' ')" >> $GITHUB_ENV

- name: Checkout
uses: actions/checkout@v4

- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: quay.io/autopilot/autopilot
tags: ${{ env.CHART_VERSION }}

- name: Log into registry
uses: docker/login-action@v3
with:
registry: quay.io
username: ${{ secrets.QUAY_USERNAME }}
password: ${{ secrets.QUAY_PASSWORD }}

- name: Build and push
uses: docker/build-push-action@v5
with:
context: autopilot-daemon
push: true
tags: ${{ steps.meta.outputs.tags }}
65 changes: 65 additions & 0 deletions HEALTH_CHECKS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Health Checks

Here is a breakdown of the existing health checks:

1. **PCIe Bandwidth Check (pciebw)**
- Description : Host-to-device connection speeds, one measurement per GPU. Codebase in tag [v12.4.1](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/1_Utilities/bandwidthTest)
- Outputs: Pass/fail results based on PCIe bandwidth thresholds.
- Implementation: Compares bandwidth results to a threshold (e.g., 8 GB/s). If the measured bandwidth falls below the threshold, it triggers a failure.
2. **GPU Memory Check (remapped)**
- Description: Information from nvidia-smi regarding GPU memory remapped rows.
- Outputs: Reports the state of GPU memory (normal/faulty).
- Implementation: Analyzes remapped rows information to assess potential GPU memory issues.
3. **GPU Memory Bandwidth Performance (gpumem)**
- Description: Memory bandwidth measurements using DAXPY and DGEMM.
- Outputs: Performance metrics (eg., TFlops, power).
- Implementation: CUDA code that valuates memory bandwidth and flags deviations from expected performance values.
4. **GPU Diagnostics (dcgm)**
- Description: Runs NVidia DCGM diagnostics using dcgmi diag.
- Outputs: Diagnostic results (pass/fail).
- Implementation: Analyzes GPU health, including memory, power, and thermal performance.
5. **PVC Create/Delete (pvc)**
- Description: Given a storage class, tests if a PVC can be created and deleted.
- Output: pass/fail depending on the success or failure of creation and deletion of a PVC. If either operation fail, the result is a failure.
- Implementation: creation of a PVC through K8s APIs.
6. **Network Reachability Check (ping)**
- Description: Pings between nodes to assess connectivity.
- Outputs: Pass/fail based on ping success.
- Implementation: all-to-all reachability test.
7. **Network Bandwidth Check (iperf)**
- Description: Tests network bandwidth by launching clients and servers on multiple interfaces through iperf3. Results are aggregated per interface results from network tests. Further details can be found in [the dedicated page](autopilot-daemon/network/README.md).
- Outputs: Aggregate bandwidth on each interface, per node (in Gb/s).
- Implementation: Tests network bandwidth by launching clients and servers on multiple interfaces and by running a ring topology on all network interfaces found on the pod that are exposed by network controllers like multi-nic CNI, which exposes fast network interfaces in the pods requesting them. Does not run on `eth0`.

These checks are configured to run periodically (e.g., hourly), and results are accessible via Prometheus, direct API queries or labels on the worker nodes.

## Deep Diagnostics and Node Labeling

Autopilot runs health checks periodically on GPU nodes, and if any of the health checks returns an error, the node is labeled with `autopilot.ibm.com/gpuhealth: ERR`. Otherwise, the label is set as `PASS`.

Also, more extensive tests, namely DCGM diagnostics level 3, are also executed automatically only on nodes that have free GPUs. This deeper analysis is needed to reveal problems in the GPUs that can be found only after running level 3 DCGM diagnostic.
This type of diagnostics can help deciding if the worker node should be used for running workloads or not. To facilitate this task, Autopilot will label nodes with key `autopilot.ibm.com/dcgm.level.3`.

If errors are found during the level 3 diagnostics, the label `autopilot.ibm.com/dcgm.level.3` will contain detailed information about the error in the following format:

`ERR_Year-Month-Date_Hour.Minute.UTC_Diagnostic_Test.gpuID,Diagnostic_Test.gpuID,...`

- `ERR`: An indicator that an error has occurred
- `Year-Month-Date_Hour.Minute.UTC`: Timestamp of completed diagnostics
- `Diagnostic_Test`: Name of the test that has failed (formatted to replace spaces with underscores)
- `gpuID`: ID of GPU where the failure has occurred

**Example:** `autopilot.ibm.com/dcgm.level.3=ERR_2024-10-10_19.12.03UTC_page_retirement_row_remap.0`

If there are no errors, the value is set to `PASS_Year-Month-Date_Hour.Minute.UTC`.

### Logs and Metrics

All health checks results are exported through Prometheus, but they can be also found in each pod's logs.

All metrics are accessible through Prometheus and Grafana dashboards. The gauge exposed is `autopilot_health_checks` and can be customized with the following filters:

- `check`, select one or more specific health checks
- `node`, filter by node name
- `cpumodel` and `gpumodel`, for heterogeneous clusters
- `deviceid` to select specific GPUs, when available
187 changes: 5 additions & 182 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,189 +58,12 @@ By default, the periodic checks list contains PCIe, rows remapping, GPUs power,

Results from health checks are exported as Prometheus Gauges, so that users and admins can easily check the status of the system on Grafana.

## Deep Diagnostics and Node Labeling
Detailed description of all the health checks, can be found in [HEALTH_CHECKS.md](HEALTH_CHECKS.md).

Autopilot runs health checks periodically and labels the nodes with `autopilot.ibm.com/gpuhealth: ERR` is any of the GPU health checks returns an error. Otherwise, health is set as `PASS`.
## Install

Also, more extensive tests, namely DCGM diagnostics level 3, are also executed automatically only on nodes that have free GPUs. This deeper analysis is needed to reveal problems in the GPUs that can be found only after running level 3 DCGM diagnostic.
This type of diagnostics can help deciding if the worker node should be used for running workloads or not. To facilitate this task, Autopilot will label nodes with key `autopilot.ibm.com/dcgm.level.3`.
To learn how to install Autopilot, please refer to [SETUP.md](SETUP.md)

If errors are found, the label `autopilot.ibm.com/dcgm.level.3` will contain the value `ERR`, a timestamp, the test(s) that failed and the GPU id(s) if available. Otherwise, the value is set to `PASS_timestamp`.
## Usage

### Logs and Metrics

All health checks results are exported through Prometheus, but they can be also found in each pod's logs.

All metrics are accessible through Prometheus and Grafana dashboards. The gauge exposed is `autopilot_health_checks` and can be customized with the following filters:

- `check`, select one or more specific health checks
- `node`, filter by node name
- `cpumodel` and `gpumodel`, for heterogeneous clusters
- `deviceid` to select specific GPUs, when available

## Install Autopilot

Autopilot can be installed through Helm and need admin privileges to create objects like services, serviceaccounts, namespaces and relevant RBAC.

**NOTE**: this install procedure does NOT allow the use of `--create-namespace` or `--namespace=autopilot` in the `helm` command. This is because our helm chart, creates a namespace with a label, namely, we are creating a namespace with the label `openshift.io/cluster-monitoring: "true"`, so that Prometheus can scrape metrics. This applies to OpenShift clusters **only**. It is not possible, in Helm, to create namespaces with labels or annotations through the `--create-namespace` parameter, so we decided to create the namespace ourselves.

Therefore, we recommend one of the following options, which are mutually exclusive:

- Option 1: use `--create-namespace` with `--namespace=autopilot` in the helm cli AND disable namespace creation in the helm chart config file `namespace.create=false`. **If on OpenShift**, then manually label the namespace for Prometheus with `label ns autopilot "openshift.io/cluster-monitoring"=`.
- Option 2: use `namespace.create=true` in the helm chart config file BUT NOT use `--create-namespace` in the helm cli. Can still use `--namespace` in the helm cli but it should be set to something else (i.e., `default`).
- Option 3: create the namespace by hand `kubectl create namespace autopilot`, use `--namespace autopilot` in the helm cli and set `namespace.create=false` in the helm config file. **If on OpenShift**, then manually label the namespace for Prometheus with `label ns autopilot "openshift.io/cluster-monitoring"=`.

In the next release, we will remove the namespace from the Helm templates and will add OpenShift-only configurations separately.

### Requirements

- Need to install `helm-git` plugin

```bash
helm plugin install https://github.com/aslafy-z/helm-git --version 0.15.1
```

### Helm Chart customization

Helm charts values and how-to for customization can be found [here](https://github.com/IBM/autopilot/tree/main/autopilot-daemon/helm-charts/autopilot).

### Install

1) Add autopilot repo

```bash
helm repo add autopilot git+https://github.com/IBM/autopilot.git@autopilot-daemon/helm-charts/autopilot?ref=gh-pages
```

2) Install autopilot (idempotent command). The config file is for customizing the helm values. Namespace is where the helm chart will live, not the namespace where Autopilot runs

```bash
helm upgrade autopilot autopilot/autopilot --install --namespace=<default> -f your-config.yml
```

The controllers should show up in the selected namespace

```bash
kubectl get po -n autopilot
```

```bash
NAME READY STATUS RESTARTS AGE
autopilot-daemon-autopilot-g7j6h 1/1 Running 0 70m
autopilot-daemon-autopilot-g822n 1/1 Running 0 70m
autopilot-daemon-autopilot-x6h8d 1/1 Running 0 70m
autopilot-daemon-autopilot-xhntv 1/1 Running 0 70m
```

### Uninstall

```bash
helm uninstall autopilot # -n <default>
```

## Manually Query the Autopilot Service

Autopilot provides a `/status` handler that can be queried to get the entire system status, meaning that it will run all the tests on all the nodes. Autopilot is reachable by service name `autopilot-healthchecks.autopilot.svc` in-cluster only, meaning it can be reached from a pod running in the cluster, or through port forwarding (see below).

Health check names are `pciebw`, `dcgm`, `remapped`, `ping`, `iperf`, `pvc`, `gpumem`.

For example, using port forwarding to localhost or by exposing the service

```bash
kubectl port-forward service/autopilot-healthchecks 3333:3333 -n autopilot
# or kubectl expose service autopilot-healthchecks -n autopilot
```

If using port forward, then launch `curl` on another terminal

```bash
curl "http://localhost:3333/status?check=pciebw&host=nodename1"
```

Alternatively, retrieve the route with `kubectl get routes autopilot-healthchecks -n autopilot`

```bash
curl "http://<route-name>/status?check=pciebw&host=nodename1"
```

All tests can be tailored by a combination of:

- `host=<hostname1,hostname2,...>`, to run all tests on a specific node or on a comma separated list of nodes.
- `check=<healthcheck1,healtcheck2,...>`, to run a single test (`pciebw`, `dcgm`, `remapped`, `gpumem`, `ping`, `iperf` or `all`) or a list of comma separated tests. When no parameters are specified, only `pciebw`, `dcgm`, `remapped`, `ping` tests are run.
- `batch=<#hosts>`, how many hosts to check at a single moment. Requests to the batch are run in parallel asynchronously. Batching is done to avoid running too many requests in parallel when the number of worker nodes increases. Defaults to all nodes.

Some health checks provide further customization.

### DCGM

This test runs `dcgmi diag`, and we support only `r` as [parameter](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html#command-line-options).

The default is `1`, but can customize it by `/status?check=dcgm&r=2`.

### Network Bandwidth Validation with IPERF

As part of this workload, Autopilot will generate the Ring Workload and then start `iperf3 servers` on each interface on each Autopilot pod based on the configuration options provided by the user. Only after the `iperf3 servers` are started, Autopilot will begin executing the workload by starting `iperf3 clients` based on the configuration options provided by the user. All results are logged back to the user.

- For each network interface on each node, an `iperf3 server` is started. The number of `iperf3 servers` is dependent on the `number of clients` intended on being run. For example, if the `number of clients` is `8`, then there will be `8` `iperf3 servers` started per interface on a unique `port`.

- For each timestep, all `pairs` are executed simultaneously. For each pair some `number of clients` are started in parallel and will run for `5 seconds` using `zero-copies` against a respective `iperf3 server`
- Metrics such `minimum`, `maximum`, `mean`, `aggregate` bitrates and transfers are tracked. The results are stored both as `JSON` in the respective `pod` as well as summarized and dumped into the `pod logs`.
- Invocation from the exposed Autopilot API is as follows below:

```bash
# Invoked via the `status` handle:
curl "http://autopilot-healthchecks-autopilot.<domain>/status?check=iperf&workload=ring&pclients=<NUMBER_OF_IPERF3_CLIENTS>&startport=<STARTING_IPERF3_SERVER_PORT>"

# Invoked via the `iperf` handle directly:
curl "http://autopilot-healthchecks-autopilot.<domain>/iperf?workload=ring&pclients=<NUMBER_OF_IPERF3_CLIENTS>&startport=<STARTING_IPERF3_SERVER_PORT>"
```

### Concrete Example

In this example, we target one node and check the pcie bandwidth and use the port-forwarding method.
In this scenario, we have a value lower than `8GB/s`, which results in an alert. This error will be exported to the OpenShift web console and on Slack, if that is enabled by admins.

```bash
curl "http://127.0.0.1:3333/status?check=pciebw"
```

The output of the command above, will be similar to the following (edited to save space):

```bash
Checking status on all nodes
Autopilot Endpoint: 10.128.6.187
Node: hostname
url(s): http://10.128.6.187:3333/status?host=hostname&check=pciebw
Response:
Checking system status of host hostname (localhost)

[[ PCIEBW ]] Briefings completed. Continue with PCIe Bandwidth evaluation.
[[ PCIEBW ]] FAIL
Host hostname
12.3 12.3 12.3 12.3 5.3 12.3 12.3 12.3

Node Status: PCIE Failed
-------------------------------------


Autopilot Endpoint: 10.131.4.93
Node: hostname2
url(s): http://10.131.4.93:3333/status?host=hostname2&check=pciebw
Response:
Checking system status of host hostname2 (localhost)

[[ PCIEBW ]] Briefings completed. Continue with PCIe Bandwidth evaluation.
[[ PCIEBW ]] SUCCESS
Host hostname2
12.1 12.0 12.3 12.3 11.9 11.5 12.1 12.1

Node Status: Ok
-------------------------------------

Node Summary:

{'hostname': ['PCIE Failed'],
'hostname2': ['Ok']}

runtime: 31.845192193984985 sec
```
To learn how to invoke health checks, please refer to [USAGE.md](USAGE.md).
Loading