Skip to content

Commit

Permalink
update hyperzoo doc and k8s doc (#3959)
Browse files Browse the repository at this point in the history
* update userguide of k8s

* update k8s guide

* update hyperzoo doc

* Update k8s.md

add note

* Update k8s.md

add note

* Update k8s.md

update notes
  • Loading branch information
Adria777 authored May 20, 2021
1 parent 53a0ec1 commit 28d5789
Show file tree
Hide file tree
Showing 2 changed files with 122 additions and 2 deletions.
92 changes: 90 additions & 2 deletions docker/hyperzoo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,95 @@ Then pull the image. It will be faster.
sudo docker pull intelanalytics/hyper-zoo:latest
```

2. Launch a k8s client container:
2.K8s configuration

Get k8s master as spark master :

```bash
kubectl cluster-info
```

After running this commend, it shows "Kubernetes master is running at https://127.0.0.1:12345 "

this means :

```bash
master="k8s://https://127.0.0.1:12345"
```

The namespace is default or spark.kubernetes.namespace

RBAC :

```bash
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
```

View k8s configuration file :

```
.kube/config
```

or

```bash
kubectl config view --flatten --minify > kuberconfig
```

The k8s data can stored in nfs or ceph, take nfs as an example

In NFS server, run :

```bash
yum install nfs-utils
systemctl enable rpcbind
systemctl enable nfs
systemctl start rpcbind
firewall-cmd --zone=public --permanent --add-service={rpc-bind,mountd,nfs}
firewall-cmd --reload
mkdir /disk1/nfsdata
chmod 755 /disk1/nfsdata
nano /etc/exports "/disk1/nfsdata *(rw,sync,no_root_squash,no_all_squash)"
systemctl restart nfs
```

In NFS client, run :

```bash
yum install -y nfs-utils && systemctl start rpcbind && showmount -e <nfs-master-ip-address>
```

k8s conf :

```bash
git clone https://github.com/kubernetes-incubator/external-storage.git
cd /XXX/external-storage/nfs-client
nano deploy/deployment.yaml
nano deploy/rbac.yaml
kubectl create -f deploy/rbac.yaml
kubectl create -f deploy/deployment.yaml
kubectl create -f deploy/class.yaml
```

test :

```bash
kubectl create -f deploy/test-claim.yaml
kubectl create -f deploy/test-pod.yaml
kubectl get pvc
kubectl delete -f deploy/test-pod.yaml
kubectl delete -f deploy/test-claim.yaml
```

if the test is success, then run:

```bash
kubectl create -f deploy/nfs-volume-claim.yaml
```

3.Launch a k8s client container:

Please note the two different containers: **client container** is for user to submit zoo jobs from here, since it contains all the required env and libs except hadoop/k8s configs; executor container is not need to create manually, which is scheduled by k8s at runtime.

Expand Down Expand Up @@ -313,4 +401,4 @@ ${SPARK_HOME}/bin/spark-submit \
--conf "spark.driver.extraJavaOptions=-Dbigdl.engineType=mklblas" \
--class com.intel.analytics.zoo.serving.ClusterServing \
local:/opt/analytics-zoo-0.8.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.10.0-spark_2.4.3-0.8.0-SNAPSHOT-jar-with-dependencies.jar
```
```
32 changes: 32 additions & 0 deletions docs/readthedocs/source/doc/UserGuide/k8s.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,26 @@ init_orca_context(cluster_mode="k8s", master="k8s://https://<k8s-apiserver-host>

Execute `python script.py` to run your program on k8s cluster directly.

**Note**: The k8s client and cluster mode do not support download files to local, logging callback, tensorboard callback, etc. If you have these requirements, it's a good idea to use network file system (NFS).

**Note**: The k8s would delete the pod once the worker failed in client mode and cluster mode. If you want to get the content of of worker log, you could set an "temp-dir" to change the log dir to replace the former one. Please note that in this case you should set num-nodes to 1 if you use network file system (NFS). Otherwise, it would cause error because the temp-dir and NFS are not point to the same directory.

```python
init_orca_context(..., extra_params = {"temp-dir": "/tmp/ray/"})
```

If you training with more than 1 executor and use NFS, please remove `extra_params = {"temp-dir": "/tmp/ray/"}`. Because there would be conflict if multiple executors write files in the same directory at the same time. It may cause JSONDecodeError.

**Note**: If you training with more than 1 executor, please make sure you set proper "steps_per_epoch" and "validation steps".

**Note**: "spark.kubernetes.container.image.pullPolicy" needs to be specified as "always"

**Note**: if "RayActorError" occurs, try to increase the memory

```python
init_orca_context(..., memory=10g, exra_executor_memory_for_ray=100g)
```

#### **3.2 K8s cluster mode**

For k8s [cluster mode](https://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#cluster-mode), you can call `init_orca_context` and specify cluster_mode to be "spark-submit" in your python script (e.g. in script.py):
Expand All @@ -151,6 +171,18 @@ ${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
file:///path/script.py
```

**Note**: You should specify the spark driver and spark executor when you use NFS

```bash
${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
--... ...\
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \
file:///path/script.py
```

#### **3.3 Run Jupyter Notebooks**

After a Docker container is launched and user login into the container, you can start the Jupyter Notebook service inside the container.
Expand Down

0 comments on commit 28d5789

Please sign in to comment.