update hyperzoo doc and k8s doc (#3959)

* update userguide of k8s * update k8s guide * update hyperzoo doc * Update k8s.md add note * Update k8s.md add note * Update k8s.md update notes
intel-analytics · May 20, 2021 · 28d5789 · 28d5789
1 parent 53a0ec1
commit 28d5789
Show file tree

Hide file tree

Showing 2 changed files with 122 additions and 2 deletions.
diff --git a/docker/hyperzoo/README.md b/docker/hyperzoo/README.md
@@ -55,7 +55,95 @@ Then pull the image. It will be faster.
 sudo docker pull intelanalytics/hyper-zoo:latest
 ```
 
-2. Launch a k8s client container:
+2.K8s configuration
+
+Get k8s master as spark master :
+
+```bash
+kubectl cluster-info
+```
+
+After running this commend, it shows "Kubernetes master is running at https://127.0.0.1:12345 "
+
+this means :
+
+```bash
+master="k8s://https://127.0.0.1:12345"
+```
+
+The namespace is default or spark.kubernetes.namespace
+
+RBAC : 
+
+```bash
+kubectl create serviceaccount spark
+kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
+```
+
+View k8s configuration file : 
+
+```
+.kube/config
+```
+
+or
+
+```bash
+kubectl config view --flatten --minify > kuberconfig
+```
+
+The k8s data can stored in nfs or ceph, take nfs as an example
+
+In NFS server, run :
+
+```bash
+yum install nfs-utils
+systemctl enable rpcbind
+systemctl enable nfs
+systemctl start rpcbind
+firewall-cmd --zone=public --permanent --add-service={rpc-bind,mountd,nfs}
+firewall-cmd --reload
+mkdir /disk1/nfsdata
+chmod 755 /disk1/nfsdata
+nano /etc/exports "/disk1/nfsdata *(rw,sync,no_root_squash,no_all_squash)"
+systemctl restart nfs
+```
+
+In NFS client, run :
+
+```bash
+yum install -y nfs-utils && systemctl start rpcbind && showmount -e <nfs-master-ip-address>
+```
+
+k8s conf :
+
+```bash
+git clone https://github.com/kubernetes-incubator/external-storage.git
+cd /XXX/external-storage/nfs-client
+nano deploy/deployment.yaml
+nano deploy/rbac.yaml
+kubectl create -f deploy/rbac.yaml
+kubectl create -f deploy/deployment.yaml
+kubectl create -f deploy/class.yaml
+```
+
+test :
+
+```bash
+kubectl create -f deploy/test-claim.yaml
+kubectl create -f deploy/test-pod.yaml
+kubectl get pvc
+kubectl delete -f deploy/test-pod.yaml
+kubectl delete -f deploy/test-claim.yaml
+```
+
+if the test is success, then run:
+
+```bash
+kubectl create -f deploy/nfs-volume-claim.yaml
+```
+
+3.Launch a k8s client container:
 
 Please note the two different containers: **client container** is for user to submit zoo jobs from here, since it contains all the required env and libs except hadoop/k8s configs; executor container is not need to create manually, which is scheduled by k8s at runtime.
 
@@ -313,4 +401,4 @@ ${SPARK_HOME}/bin/spark-submit \
   --conf "spark.driver.extraJavaOptions=-Dbigdl.engineType=mklblas" \
   --class com.intel.analytics.zoo.serving.ClusterServing \
   local:/opt/analytics-zoo-0.8.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.10.0-spark_2.4.3-0.8.0-SNAPSHOT-jar-with-dependencies.jar
-```
+```
diff --git a/docs/readthedocs/source/doc/UserGuide/k8s.md b/docs/readthedocs/source/doc/UserGuide/k8s.md
@@ -125,6 +125,26 @@ init_orca_context(cluster_mode="k8s", master="k8s://https://<k8s-apiserver-host>
 
 Execute `python script.py` to run your program on k8s cluster directly.
 
+**Note**: The k8s client and cluster mode do not support download files to local, logging callback, tensorboard callback, etc. If you have these requirements, it's a good idea to use network file system (NFS).
+
+**Note**: The k8s would delete the pod once the worker failed in client mode and cluster mode. If you want to get the content of of worker log, you could set an "temp-dir" to change the log dir to replace the former one. Please note that in this case you should set num-nodes to 1 if you use network file system (NFS).  Otherwise, it would cause error because the temp-dir and NFS are not point to the same directory.
+
+```python
+init_orca_context(..., extra_params = {"temp-dir": "/tmp/ray/"})
+```
+
+If you training with more than 1 executor and use NFS, please remove `extra_params = {"temp-dir": "/tmp/ray/"}`. Because there would be conflict if multiple executors write files in the same directory at the same time. It may cause JSONDecodeError.
+
+**Note**: If you training with more than 1 executor, please make sure you set proper "steps_per_epoch" and "validation steps".
+
+**Note**: "spark.kubernetes.container.image.pullPolicy" needs to be specified as "always"
+
+**Note**: if  "RayActorError" occurs, try to increase the memory
+
+```python
+init_orca_context(..., memory=10g, exra_executor_memory_for_ray=100g)
+```
+
 #### **3.2 K8s cluster mode**
 
 For k8s [cluster mode](https://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#cluster-mode), you can call `init_orca_context` and specify cluster_mode to be "spark-submit" in your python script (e.g. in script.py):
@@ -151,6 +171,18 @@ ${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
   file:///path/script.py
 ```
 
+**Note**: You should specify the spark driver and spark executor when you use NFS
+
+```bash
+${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
+  --... ...\
+  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \
+  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \
+  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \
+  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \
+  file:///path/script.py
+```
+
 #### **3.3 Run Jupyter Notebooks**
 
 After a Docker container is launched and user login into the container, you can start the Jupyter Notebook service inside the container.