microsoft · hzy46 · Sep 22, 2020 · Jul 31, 2020 · Aug 21, 2020 · Aug 21, 2020
diff --git a/docs/manual/cluster-admin/README.md b/docs/manual/cluster-admin/README.md
@@ -15,6 +15,7 @@ This manual is for cluster administrators to learn the installation and uninstal
 7. [How to Add and Remove Nodes](./how-to-add-and-remove-nodes.md)
 8. [How to use CPU Nodes](./how-to-use-cpu-nodes.md)
 9. [How to Customize Cluster by Plugins](./how-to-customize-cluster-by-plugins.md)
-10. [Troubleshooting](./troubleshooting.md)
-11. [How to Uninstall OpenPAI](./how-to-uninstall-openpai.md)
-12. [Upgrade Guide](./upgrade-guide.md)
+10. [Alerting-and-Troubleshooting](./alerting-and-troubleshooting.md)
+11. [Recommended Practice](./recommended-practice.md)
+12. [How to Uninstall OpenPAI](./how-to-uninstall-openpai.md)
+13. [Upgrade Guide](./upgrade-guide.md)
diff --git a/docs/manual/cluster-admin/alerting-and-troubleshooting.md b/docs/manual/cluster-admin/alerting-and-troubleshooting.md
@@ -0,0 +1,130 @@
+# Alerting and Troubleshooting
+
+OpenPAI uses [Prometheus](https://prometheus.io/) to monitor the system. You can view the monitoring information [on webportal](./basic-management-operations.md#management-on-webportal). For alerting, OpenPAI uses [alert manager](https://prometheus.io/docs/alerting/latest/alertmanager/), but it is not set up in default installation. This document describes how to set up alert manager, and how to deal with some common alerts. It also includes some other troubleshooting cases in practice.
+
+## Set Up Alert Manager
+
+OpenPAI's alert manager is set to send alerting e-mails when alert happens. To begin with, you should get an SMTP account to send these e-mails.
+
+After getting an SMTP account, how to set up the alert manager in PAI? Please read the document about [service management and paictl](./basic-management-operations.md#pai-service-management-and-paictl) first, and start a dev box container. Then, in the dev box container, pull the configuration by:
+
+```bash
+./paictl config pull -o /cluster-configuration
+```
+
+Uncomment the alert manager section in `/cluster-configuration/services-configuration.yaml`, and set your SMTP account and the receiver's e-mail address. Here is an example:
+
+```bash
+alert-manager:
+  port: 9093
+  receiver: <receiver-email-address>
+  smtp_auth_password: <smtp-password>
+  smtp_auth_username: <smtp-username>
+  smtp_from: <smtp-email-address>
+  smtp_url: <smtp-server>:<smtp-port>
+```
+
+Configuration `port` stands for the port of alert manager. In most cases, you don't need to change it. Configuration `receiver` is usually set to be the administrator's e-mail address to receive alerting e-mails.
+
+Save the configuration file, and start alert manager by:
+
+```bash
+./paictl.py service start -n alert-manager
+./paictl.py config push -p /cluster-configuration -m service
+./paictl.py service start -n alert-manager
+```
+
+After alert manager is successfully started, the receiver's e-mail address will receive alerting e-mails from the SMTP account. Also, you can view the alerting information on Webportal (in the top-right corner):
+
+   <img src="./imgs/alert-on-webportal.png" width="100%" height="100%" />
+
+
+## Troubleshooting
+
+### PaiServicePodNotReady Alert
+
+This is a kind of alert from alert manager, and usually caused by container being killed by operator or OOM killer. To check if it was killed by OOM killer, you can check node's free memory via Prometheus:
+
+  1. visit Prometheus web page, it is usually `http://<your-pai-master-ip>:9091`.
+  2. Enter query `node_memory_MemFree_bytes`.
+  3. If free memory drop to near 0, the container should be killed by OOM killer
+  4. You can double check this by logging into node and run command `dmesg` and looking for phase `oom`.
+
+Solutions:
+
+  1. Force remove unhealth containers with this command in terminal:
+  `kubectl delete pod pod-name --grace-period=0 --force`
+  2. Recreate pod in Kubernetes, this operation may block indefinitely because dockerd may not functioning correctly after OOM. If recreate blocked too long, you can log into the node and restart dockerd via `/etc/init.d/docker restart`.
+
+### NodeNotReady Alert
+
+This is a kind of alert from alert manager, and is reported by watchdog service. Watchdog gets such metrics from Kubernetes API. Example metrics is like:
+
+```
+pai_node_count{disk_pressure="false",instance="10.0.0.1:9101",job="pai_serivce_exporter",memory_pressure="false",name="10.0.0.2",out_of_disk="false",pai_service_name="watchdog",ready="true",scraped_from="watchdog-5ddd945975-kwhpr"}
+```
+
+The name label indicate what node this metric represents.
+
+If the node's ready label has value "unknown", this means the node may disconnect from Kubernetes master, this may due to several reasons:
+
+  - Node is down
+  - Kubelet is down
+  - Network partition between node and Kubernetes master
+
+You can first try to log into the node. If you can not, and have no ping response, the node may be down, and you should boot it up.
+
+If you can log in to the node, you should check if the kubelet is up and running, execute `sudo docker ps` command in the node, normally the kubelet container will be listed. The output should like:
+
+```
+  a66a171434cc  gcr.io/google_containers/hyperkube:v1.9.9   "/hyperkube kubele..."   2 weeks ago      Up 2 weeks     kubelet
+```
+
+After this step, you should check the log of kubelet, to see if it can access Kubernetes API. If you see something like:
+
+```
+  E0410 04:24:30.663050    2491 kubelet_node_status.go:386] Error updating node status, will retry: error getting node "10.0.0.1": Get http://10.0.1.2:8080/api/v1/nodes/10.0.0.1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
+```
+
+This means the node can not report its status to Kubernetes, and hence the Kubernetes will post unknown status, and this triggered the alert.
+
+You should check what caused this connectivity problem.
+
+### NodeFilesystemUsage Alert
+
+This is a kind of alert from alert manager, and is used to monitor disk space of each server. If usage of disk space is greater than 80%, this alert will be triggered. OpenPAI has two services may use a lot of disk space. They are storage manager and docker image cache. If there is other usage of OpenPAI servers, they should be checked to avoid the disk usage is caused by outside of OpenPAI.
+
+Solutions:
+
+  1. Check user file on the NFS storage server launched by storage manager. If you didn't set up a storage manager, ignore this step.
+  2. Check the docker cache. The docker may use too many disk space for caching, it's worth to have a check.
+
+### NVIDIA GPU is Not Detected
+
+If you cannot use GPU in your job, please check the following items on the corresponding worker node:
+
+ 1. The NVIDIA drivers should be installed correctly. Use `nvidia-smi` to confirm.
+ 2. [nvidia-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime) is installed, and configured as the default runtime of docker. Use `docker info -f "{{json .DefaultRuntime}}"` to confirm.
+
+If the GPU number shown in webportal is wrong, check the [hivedscheduler and VC configuration](./how-to-set-up-virtual-clusters.md).
+
+### Cannot See Utilization Information.
+
+If you cannot see utilization information (e.g. GPU, CPU, and network usage) in cluster, please check if the service `prometheus`, `grafana`, `job-exporter`, and `node-exporter` are working.
+
+To be detailed, you can [exec into a dev box container](./basic-management-operations.md#pai-service-management-and-paictl), then check the service status by `kubectl get pod`. You can see the pod log by `kubectl logs <pod-name>`. After you fix the problem, you can [restart the whole cluster using paictl](./basic-management-operations.md#pai-service-management-and-paictl).
+
+
+### Node is De-allocated and doesn't Appear in Kubernetes System when it Comes Back
+
+Working nodes can be de-allocated if you are using a cloud service and set up PAI on low-priority machines. Usually, if the node is lost temporarily, you can wait until the node comes back. It doesn't need any special care. 
+
+However, some cloud service providers not only de-allocate nodes, but also remove all disk contents on the certain nodes. Thus the node cannot connect to Kubernetes automatically when it comes back. If it is your case, we recommend you to set up a crontab job on the dev box node to bring back these nodes periodically.
+
+In [How to Add and Remove Nodes](how-to-add-and-remove-nodes.md), we have described how to add a node. The crontab job doesn't need to do all of these things. It only needs to add the node to the Kubernetes. It figures out which nodes have come back but are still considered `NotReady` in Kubernetes, then, run the following command to bring it back:
+
+```bash
+ansible-playbook -i inventory/mycluster/hosts.yml upgrade-cluster.yml --become --become-user=root  --limit=${limit_list} -e "@inventory/mycluster/openpai.yml"
+```
+
+`${limit_list}` stands for the names of these de-allocated nodes. For example, if the crontab job finds node `a` and node `b` are available now, but they are still in `NotReady` status in Kuberentes, then it can set `limit_list=a,b`.
diff --git a/docs/manual/cluster-admin/how-to-add-and-remove-nodes.md b/docs/manual/cluster-admin/how-to-add-and-remove-nodes.md
@@ -24,9 +24,9 @@ Log in to your dev box machine, find [the pre-kept folder `~/pai-deploy`](./inst
 
 ### Add the Nodes into Kubernetes
 
-Find the file `~/pai-deploy/kubespray/inventory/pai/host.yml`, and follow the steps below to modify it. 
+Find the file `~/pai-deploy/kubespray/inventory/pai/hosts.yml`, and follow the steps below to modify it. 
 
-Supposing you want to add 2 worker nodes into your cluster and their hostnames are `a` and `b`.  Add these 2 nodes into the `host.yml`. An example:
+Supposing you want to add 2 worker nodes into your cluster and their hostnames are `a` and `b`.  Add these 2 nodes into the `hosts.yml`. An example:
 
 ```yaml
 all:
@@ -157,7 +157,7 @@ If you have configured any PV/PVC storage, please confirm the added worker node
 
 Please refer to the operation of add nodes. They are very similar.
 
-First, modify `host.yml` accordingly, then go into `~/pai-deploy/kubespray/`, run
+First, modify `hosts.yml` accordingly, then go into `~/pai-deploy/kubespray/`, run
 
 ```bash
 ansible-playbook -i inventory/mycluster/hosts.yml upgrade-cluster.yml --become --become-user=root  --limit=a,b -e "@inventory/mycluster/openpai.yml"

diff --git a/docs/manual/cluster-admin/how-to-set-up-storage.md b/docs/manual/cluster-admin/how-to-set-up-storage.md
@@ -238,4 +238,156 @@ The GET request must use header `Authorization: Bearer <token>` for authorizatio
 }
 ```
 
-Do not omit any fields in `extension` or it will change the `virtualClusters` setting unexpectedly.
+Do not omit any fields in `extension` or it will change the `virtualClusters` setting unexpectedly.
+
+## Example: Use Storage Manager to Create an NFS + SAMBA Server
+
+To help you set up the storage, OpenPAI provides a storage manager, which can set up an NFS + SAMBA server. In the cluster, the NFS storage can be accessed in OpenPAI containers. Out of the cluster, users can mount the storage on Unix-like system, or access it in File Explorer on Windows.
+
+Please read the document about [service management and paictl](./basic-management-operations.md#pai-service-management-and-paictl) first, and start a dev box container. Then, in the dev box container, pull the configuration by:
+
+```bash
+./paictl config pull -o /cluster-configuration
+```
+
+To use storage manager, you should first decide a machine in PAI system to be the storage server. The machine **must** be one of PAI workers, not PAI master. Please open `/cluster-configuration/layout.yaml`, choose a worker machine, then add a `pai-storage: "true"` field to it. Here is an example of the edited `layout.yaml`:
+
+```yaml
+......
+
+- hostname: worker1
+  nodename: worker1
+  hostip: 10.0.0.1
+  machine-type: GENERIC-WORKER
+  pai-worker: "true"
+  pai-storage: "true"  # this line is newly added
+
+......
+```
+
+In this tutorial, we assume you choose the machine with IP `10.0.0.1` as the storage server. Then, in `/cluster-configuration/services-configuration.yaml`, find the storage manager section:
+
+```yaml
+# storage-manager:
+#   localpath: /share
+#   security-type: AUTO
+#   workgroup: WORKGROUP
+#   smbuser: smbuser
+#   smbpwd: smbpwd
+```
+
+Uncomment it like:
+
+```yaml
+storage-manager:
+  localpath: /share
+#  security-type: AUTO
+#  workgroup: WORKGROUP
+  smbuser: smbuser
+  smbpwd: smbpwd
+```
+
+The `localpath` determines the root data dir for NFS on the storage server. The `smbuser` and `smbpwd` determines the username and password when you access the storage in File Explorer on Windows.
+
+Follow these commands to start the storage manager:
+
+```bash
+./paictl.py service stop -n cluster-configuration storage-manager
+./paictl.py config push -p /cluster-configuration -m service
+./paictl.py service start -n cluster-configuration storage-manager
+```
+
+If the storage manager is successfully started, you will find the folder `/share/data` and `/share/users` on the storage server. On a Ubuntu machine, you can use the following command to test whether the NFS server is correctly set up:
+
+```bash 
+# replace 10.0.0.1 with your storage server IP
+sudo apt update 
+sudo apt install nfs-common
+mkmdir -p /mnt/data
+sudo mount -t nfs --options nfsvers=4.1 10.0.0.1:/data/ /mnt/data
+```
+
+To make the NFS storage available in PAI, we should create the PV and PVC for it. Thus, create the following `nfs-storage.yaml` file in the dev box container first:
+
+```yaml
+# replace 10.0.0.1 with your storage server IP
+# NFS Persistent Volume
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+  name: nfs-storage-pv
+  labels:
+    name: nfs-storage
+spec:
+  capacity:
+    storage: 10Gi
+  volumeMode: Filesystem
+  accessModes:
+    - ReadWriteMany
+  persistentVolumeReclaimPolicy: Retain
+  mountOptions:
+    - nfsvers=4.1
+  nfs:
+    path: /data
+    server: 10.0.0.1
+---
+# NFS Persistent Volume Claim
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: nfs-storage
+# labels:
+#   share: "false"      # to mount sub path on PAI
+spec:
+  accessModes:
+    - ReadWriteMany
+  volumeMode: Filesystem
+  resources:
+    requests:
+      storage: 10Gi    # no more than PV capacity
+  selector:
+    matchLabels:
+      name: nfs-storage # corresponding to PV label
+```
+
+Use `kubectl create -f nfs-storage.yaml` to create the PV and PVC. 
+
+Since the Kuberentes PV requires the node using it has the corresponding driver, we should use `apt install nfs-common` to install the `nfs-common` package on every worker node.
+
+Finally, [assign storage to PAI groups](#assign-storage-to-pai-groups) by rest-server API. Then you can mount it into job containers.
+
+How to upload data to the storage server? On Windows, open the File Explorer, type in `\\10.0.0.1` (please change `10.0.0.1` to your storage server IP), and press ENTER. The File Explorer will ask you for authorization. Please use `smbuser` and `smbpwd` as username and password to login. On a Unix-like system, you can mount the NFS folder to the file system. For example, on Ubuntu, use the following command to mount it:
+
+```bash 
+# replace 10.0.0.1 with your storage server IP
+sudo apt update 
+sudo apt install nfs-common
+mkmdir -p /mnt/data
+sudo mount -t nfs --options nfsvers=4.1 10.0.0.1:/data/ /mnt/data
+```
+
+The above steps only set up a basic SAMBA server. So each user shares the same username and password to access it on Windows. If your cluster is in [AAD mode](./how-to-manage-users-and-groups.md#users-and-groups-in-aad-mode), and you want to integrate the SAMBA server with the AAD system, please refer to the following configuration for storage manager:
+
+```yaml
+storage-manager:
+  workgroup: # workgroup
+  security-type: ADS
+  default_realm: # default realm
+  krb5_realms: # realms
+    XXX1: # relam name
+      kdc: # kdc
+      default_domain: # default domain
+    XXX2: # relam name
+      kdc: # kdc
+      default_domain: # default domain
+  domain_realm: # domain realm
+    kdc: # kdc
+    default_domain: # default domain
+  domainuser: # domain user
+  domainpwd: # password of domain user
+  idmap: # idmap
+  - "idmap config XXX1"
+  - "idmap config XXX2"
+  - "idmap config XXX3"
+  - "idmap config XXX4"
+```