Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Fix document using DRI tickets #4828

Merged
merged 8 commits into from
Sep 22, 2020
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions docs/manual/cluster-admin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ This manual is for cluster administrators to learn the installation and uninstal
7. [How to Add and Remove Nodes](./how-to-add-and-remove-nodes.md)
8. [How to use CPU Nodes](./how-to-use-cpu-nodes.md)
9. [How to Customize Cluster by Plugins](./how-to-customize-cluster-by-plugins.md)
10. [Troubleshooting](./troubleshooting.md)
11. [How to Uninstall OpenPAI](./how-to-uninstall-openpai.md)
12. [Upgrade Guide](./upgrade-guide.md)
10. [Alerting-and-Troubleshooting](./alerting-and-troubleshooting.md)
11. [Recommended Practice](./recommended-practice.md)
12. [How to Uninstall OpenPAI](./how-to-uninstall-openpai.md)
13. [Upgrade Guide](./upgrade-guide.md)
130 changes: 130 additions & 0 deletions docs/manual/cluster-admin/alerting-and-troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Alerting and Troubleshooting

OpenPAI uses [Prometheus](https://prometheus.io/) to monitor the system. You can view the monitoring information [on webportal](./basic-management-operations.md#management-on-webportal). For alerting, OpenPAI uses [alert manager](https://prometheus.io/docs/alerting/latest/alertmanager/), but it is not set up in default installation. This document describes how to set up alert manager, and how to deal with some common alerts. It also includes some other troubleshooting cases in practice.

## Set Up Alert Manager

OpenPAI's alert manager is set to send alerting e-mails when alert happens. To begin with, you should get an SMTP account to send these e-mails.

After getting an SMTP account, how to set up the alert manager in PAI? Please read the document about [service management and paictl](./basic-management-operations.md#pai-service-management-and-paictl) first, and start a dev box container. Then, in the dev box container, pull the configuration by:

```bash
./paictl config pull -o /cluster-configuration
```

Uncomment the alert manager section in `/cluster-configuration/services-configuration.yaml`, and set your SMTP account and the receiver's e-mail address. Here is an example:

```bash
alert-manager:
port: 9093
receiver: <receiver-email-address>
smtp_auth_password: <smtp-password>
smtp_auth_username: <smtp-username>
smtp_from: <smtp-email-address>
smtp_url: <smtp-server>:<smtp-port>
```

Configuration `port` stands for the port of alert manager. In most cases, you don't need to change it. Configuration `receiver` is usually set to be the administrator's e-mail address to receive alerting e-mails.

Save the configuration file, and start alert manager by:

```bash
./paictl.py service start -n alert-manager
Binyang2014 marked this conversation as resolved.
Show resolved Hide resolved
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n alert-manager
```

After alert manager is successfully started, the receiver's e-mail address will receive alerting e-mails from the SMTP account. Also, you can view the alerting information on Webportal (in the top-right corner):

<img src="./imgs/alert-on-webportal.png" width="100%" height="100%" />


## Troubleshooting

### PaiServicePodNotReady Alert

This is a kind of alert from alert manager, and usually caused by container being killed by operator or OOM killer. To check if it was killed by OOM killer, you can check node's free memory via Prometheus:

1. visit Prometheus web page, it is usually `http://<your-pai-master-ip>:9091`.
2. Enter query `node_memory_MemFree_bytes`.
3. If free memory drop to near 0, the container should be killed by OOM killer
Binyang2014 marked this conversation as resolved.
Show resolved Hide resolved
4. You can double check this by logging into node and run command `dmesg` and looking for phase `oom`.

Solutions:

1. Force remove unhealth containers with this command in terminal:
`kubectl delete pod pod-name --grace-period=0 --force`
2. Recreate pod in Kubernetes, this operation may block indefinitely because dockerd may not functioning correctly after OOM. If recreate blocked too long, you can log into the node and restart dockerd via `/etc/init.d/docker restart`.
Binyang2014 marked this conversation as resolved.
Show resolved Hide resolved

### NodeNotReady Alert

This is a kind of alert from alert manager, and is reported by watchdog service. Watchdog gets such metrics from Kubernetes API. Example metrics is like:

```
pai_node_count{disk_pressure="false",instance="10.0.0.1:9101",job="pai_serivce_exporter",memory_pressure="false",name="10.0.0.2",out_of_disk="false",pai_service_name="watchdog",ready="true",scraped_from="watchdog-5ddd945975-kwhpr"}
```

The name label indicate what node this metric represents.

If the node's ready label has value "unknown", this means the node may disconnect from Kubernetes master, this may due to several reasons:

- Node is down
- Kubelet is down
- Network partition between node and Kubernetes master

You can first try to log into the node. If you can not, and have no ping response, the node may be down, and you should boot it up.

If you can log in to the node, you should check if the kubelet is up and running, execute `sudo docker ps` command in the node, normally the kubelet container will be listed. The output should like:
Binyang2014 marked this conversation as resolved.
Show resolved Hide resolved

```
a66a171434cc gcr.io/google_containers/hyperkube:v1.9.9 "/hyperkube kubele..." 2 weeks ago Up 2 weeks kubelet
```

After this step, you should check the log of kubelet, to see if it can access Kubernetes API. If you see something like:

```
E0410 04:24:30.663050 2491 kubelet_node_status.go:386] Error updating node status, will retry: error getting node "10.0.0.1": Get http://10.0.1.2:8080/api/v1/nodes/10.0.0.1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
```

This means the node can not report its status to Kubernetes, and hence the Kubernetes will post unknown status, and this triggered the alert.

You should check what caused this connectivity problem.
Binyang2014 marked this conversation as resolved.
Show resolved Hide resolved

### NodeFilesystemUsage Alert

This is a kind of alert from alert manager, and is used to monitor disk space of each server. If usage of disk space is greater than 80%, this alert will be triggered. OpenPAI has two services may use a lot of disk space. They are storage manager and docker image cache. If there is other usage of OpenPAI servers, they should be checked to avoid the disk usage is caused by outside of OpenPAI.

Solutions:

1. Check user file on the NFS storage server launched by storage manager. If you didn't set up a storage manager, ignore this step.
2. Check the docker cache. The docker may use too many disk space for caching, it's worth to have a check.
Binyang2014 marked this conversation as resolved.
Show resolved Hide resolved

### NVIDIA GPU is Not Detected

If you cannot use GPU in your job, please check the following items on the corresponding worker node:

1. The NVIDIA drivers should be installed correctly. Use `nvidia-smi` to confirm.
2. [nvidia-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime) is installed, and configured as the default runtime of docker. Use `docker info -f "{{json .DefaultRuntime}}"` to confirm.

If the GPU number shown in webportal is wrong, check the [hivedscheduler and VC configuration](./how-to-set-up-virtual-clusters.md).

### Cannot See Utilization Information.

If you cannot see utilization information (e.g. GPU, CPU, and network usage) in cluster, please check if the service `prometheus`, `grafana`, `job-exporter`, and `node-exporter` are working.

To be detailed, you can [exec into a dev box container](./basic-management-operations.md#pai-service-management-and-paictl), then check the service status by `kubectl get pod`. You can see the pod log by `kubectl logs <pod-name>`. After you fix the problem, you can [restart the whole cluster using paictl](./basic-management-operations.md#pai-service-management-and-paictl).


### Node is De-allocated and doesn't Appear in Kubernetes System when it Comes Back

Working nodes can be de-allocated if you are using a cloud service and set up PAI on low-priority machines. Usually, if the node is lost temporarily, you can wait until the node comes back. It doesn't need any special care.

However, some cloud service providers not only de-allocate nodes, but also remove all disk contents on the certain nodes. Thus the node cannot connect to Kubernetes automatically when it comes back. If it is your case, we recommend you to set up a crontab job on the dev box node to bring back these nodes periodically.

In [How to Add and Remove Nodes](how-to-add-and-remove-nodes.md), we have described how to add a node. The crontab job doesn't need to do all of these things. It only needs to add the node to the Kubernetes. It figures out which nodes have come back but are still considered `NotReady` in Kubernetes, then, run the following command to bring it back:

```bash
ansible-playbook -i inventory/mycluster/hosts.yml upgrade-cluster.yml --become --become-user=root --limit=${limit_list} -e "@inventory/mycluster/openpai.yml"
```

`${limit_list}` stands for the names of these de-allocated nodes. For example, if the crontab job finds node `a` and node `b` are available now, but they are still in `NotReady` status in Kuberentes, then it can set `limit_list=a,b`.
6 changes: 3 additions & 3 deletions docs/manual/cluster-admin/how-to-add-and-remove-nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,9 @@ Log in to your dev box machine, find [the pre-kept folder `~/pai-deploy`](./inst

### Add the Nodes into Kubernetes

Find the file `~/pai-deploy/kubespray/inventory/pai/host.yml`, and follow the steps below to modify it.
Find the file `~/pai-deploy/kubespray/inventory/pai/hosts.yml`, and follow the steps below to modify it.

Supposing you want to add 2 worker nodes into your cluster and their hostnames are `a` and `b`. Add these 2 nodes into the `host.yml`. An example:
Supposing you want to add 2 worker nodes into your cluster and their hostnames are `a` and `b`. Add these 2 nodes into the `hosts.yml`. An example:

```yaml
all:
Expand Down Expand Up @@ -157,7 +157,7 @@ If you have configured any PV/PVC storage, please confirm the added worker node

Please refer to the operation of add nodes. They are very similar.

First, modify `host.yml` accordingly, then go into `~/pai-deploy/kubespray/`, run
First, modify `hosts.yml` accordingly, then go into `~/pai-deploy/kubespray/`, run

```bash
ansible-playbook -i inventory/mycluster/hosts.yml upgrade-cluster.yml --become --become-user=root --limit=a,b -e "@inventory/mycluster/openpai.yml"
Expand Down
154 changes: 153 additions & 1 deletion docs/manual/cluster-admin/how-to-set-up-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,4 +238,156 @@ The GET request must use header `Authorization: Bearer <token>` for authorizatio
}
```

Do not omit any fields in `extension` or it will change the `virtualClusters` setting unexpectedly.
Do not omit any fields in `extension` or it will change the `virtualClusters` setting unexpectedly.

## Example: Use Storage Manager to Create an NFS + SAMBA Server

To help you set up the storage, OpenPAI provides a storage manager, which can set up an NFS + SAMBA server. In the cluster, the NFS storage can be accessed in OpenPAI containers. Out of the cluster, users can mount the storage on Unix-like system, or access it in File Explorer on Windows.

Please read the document about [service management and paictl](./basic-management-operations.md#pai-service-management-and-paictl) first, and start a dev box container. Then, in the dev box container, pull the configuration by:

```bash
./paictl config pull -o /cluster-configuration
```

To use storage manager, you should first decide a machine in PAI system to be the storage server. The machine **must** be one of PAI workers, not PAI master. Please open `/cluster-configuration/layout.yaml`, choose a worker machine, then add a `pai-storage: "true"` field to it. Here is an example of the edited `layout.yaml`:

```yaml
......

- hostname: worker1
nodename: worker1
hostip: 10.0.0.1
machine-type: GENERIC-WORKER
pai-worker: "true"
pai-storage: "true" # this line is newly added

......
```

In this tutorial, we assume you choose the machine with IP `10.0.0.1` as the storage server. Then, in `/cluster-configuration/services-configuration.yaml`, find the storage manager section:

```yaml
# storage-manager:
# localpath: /share
# security-type: AUTO
# workgroup: WORKGROUP
# smbuser: smbuser
# smbpwd: smbpwd
```

Uncomment it like:

```yaml
storage-manager:
localpath: /share
# security-type: AUTO
# workgroup: WORKGROUP
smbuser: smbuser
smbpwd: smbpwd
```

The `localpath` determines the root data dir for NFS on the storage server. The `smbuser` and `smbpwd` determines the username and password when you access the storage in File Explorer on Windows.

Follow these commands to start the storage manager:

```bash
./paictl.py service stop -n cluster-configuration storage-manager
./paictl.py config push -p /cluster-configuration -m service
./paictl.py service start -n cluster-configuration storage-manager
```

If the storage manager is successfully started, you will find the folder `/share/data` and `/share/users` on the storage server. On a Ubuntu machine, you can use the following command to test whether the NFS server is correctly set up:

```bash
# replace 10.0.0.1 with your storage server IP
sudo apt update
sudo apt install nfs-common
mkmdir -p /mnt/data
sudo mount -t nfs --options nfsvers=4.1 10.0.0.1:/data/ /mnt/data
```

To make the NFS storage available in PAI, we should create the PV and PVC for it. Thus, create the following `nfs-storage.yaml` file in the dev box container first:

```yaml
# replace 10.0.0.1 with your storage server IP
# NFS Persistent Volume
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-storage-pv
labels:
name: nfs-storage
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
mountOptions:
- nfsvers=4.1
nfs:
path: /data
server: 10.0.0.1
---
# NFS Persistent Volume Claim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-storage
# labels:
# share: "false" # to mount sub path on PAI
spec:
accessModes:
- ReadWriteMany
volumeMode: Filesystem
resources:
requests:
storage: 10Gi # no more than PV capacity
selector:
matchLabels:
name: nfs-storage # corresponding to PV label
```

Use `kubectl create -f nfs-storage.yaml` to create the PV and PVC.

Since the Kuberentes PV requires the node using it has the corresponding driver, we should use `apt install nfs-common` to install the `nfs-common` package on every worker node.

Finally, [assign storage to PAI groups](#assign-storage-to-pai-groups) by rest-server API. Then you can mount it into job containers.

How to upload data to the storage server? On Windows, open the File Explorer, type in `\\10.0.0.1` (please change `10.0.0.1` to your storage server IP), and press ENTER. The File Explorer will ask you for authorization. Please use `smbuser` and `smbpwd` as username and password to login. On a Unix-like system, you can mount the NFS folder to the file system. For example, on Ubuntu, use the following command to mount it:

```bash
# replace 10.0.0.1 with your storage server IP
sudo apt update
sudo apt install nfs-common
mkmdir -p /mnt/data
sudo mount -t nfs --options nfsvers=4.1 10.0.0.1:/data/ /mnt/data
```

The above steps only set up a basic SAMBA server. So each user shares the same username and password to access it on Windows. If your cluster is in [AAD mode](./how-to-manage-users-and-groups.md#users-and-groups-in-aad-mode), and you want to integrate the SAMBA server with the AAD system, please refer to the following configuration for storage manager:

```yaml
storage-manager:
workgroup: # workgroup
security-type: ADS
default_realm: # default realm
krb5_realms: # realms
XXX1: # relam name
kdc: # kdc
default_domain: # default domain
XXX2: # relam name
kdc: # kdc
default_domain: # default domain
domain_realm: # domain realm
kdc: # kdc
default_domain: # default domain
domainuser: # domain user
domainpwd: # password of domain user
idmap: # idmap
- "idmap config XXX1"
- "idmap config XXX2"
- "idmap config XXX3"
- "idmap config XXX4"
```
Loading