Add doc for ml runtime examples and spark optimizations (oap-project#924

)
HongW2019 · Oct 28, 2022 · fafab98 · fafab98
1 parent 628c3a3
commit fafab98
Show file tree

Hide file tree

Showing 10 changed files with 146 additions and 36 deletions.
diff --git a/README.md b/README.md
@@ -2,14 +2,14 @@
 
 CloudTik is a cloud scale platform for distributed analytics and AI on public cloud providers including AWS, Azure, GCP, and so on.
 CloudTik enables users or enterprises to easily create and manage analytics and AI platform on public clouds,
-with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads in minutes or hours
-instead of spending months to construct and optimize the platform.
+with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads
+in hours or in even minutes instead of spending months to construct and optimize the platform.
 
-We target CloudTik to be:
+CloudTik provides:
 - Scalable, robust, and unified control plane and runtimes for all public clouds
 - Out of box optimized runtimes for analytics and AI (Spark, ...)
-- Support major public cloud providers - AWS, Azure, GCP, Kubernetes (EKS, AKS, and GKE) and more
-- Fully open architecture and open-sourced solution
+- Support of major public cloud providers - AWS, Azure, GCP, Kubernetes (EKS, AKS, and GKE) and more
+- A fully open architecture and open-sourced solution
 
 ## Getting Started with CloudTik
 

diff --git a/docs/source/GettingStarted/introduction.md b/docs/source/GettingStarted/introduction.md
@@ -5,13 +5,14 @@ of various aspects including cost, performance, complexity and openness. CloudTi
 
 CloudTik is a cloud scale platform for distributed analytics and AI on public cloud providers including AWS, Azure, GCP, and so on.
 CloudTik enables users or enterprises to easily create and manage analytics and AI platform on public clouds,
-with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads in minutes or hours
-instead of spending months to construct and optimize the platform.
+with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads
+in hours or in even minutes instead of spending months to construct and optimize the platform.
 
+CloudTik provides:
 - Scalable, robust, and unified control plane and runtimes for all public clouds
 - Out of box optimized runtimes for analytics and AI (Spark, ...)
-- Support major public cloud providers - AWS, Azure, GCP, Kubernetes (EKS, AKS, and GKE) and more
-- Fully open architecture and open-sourced solution
+- Support of major public cloud providers - AWS, Azure, GCP, Kubernetes (EKS, AKS, and GKE) and more
+- A fully open architecture and open-sourced solution
 
 CloudTik offers the industry an open-sourced solution for users to build and run the production level analytics and
 AI platform on Cloud in hours. CloudTik provides a powerful, but easy to use infrastructure to build and manage

diff --git a/docs/source/UserGuide/RunningOptimizedAI/running-ml-examples.md b/docs/source/UserGuide/RunningOptimizedAI/running-ml-examples.md
@@ -0,0 +1,8 @@
+# Running Machining Examples
+
+CloudTik provides ready to run examples for demonstrating
+how distributed machine learning and deep learning jobs can be implemented
+in CloudTik Spark and ML runtime cluster.
+
+Refer to [Distributed Machine Learning and Deep Learning Examples](https://github.com/oap-project/cloudtik/tree/main/example/ml)
+for a detailed step-by-step guide.
diff --git a/docs/source/UserGuide/RunningOptimizedAnalytics/running-analytics-benchmarks.md b/docs/source/UserGuide/RunningOptimizedAnalytics/running-analytics-benchmarks.md
@@ -0,0 +1,9 @@
+# Running Analytics Benchmarks
+
+CloudTik provides ready to use tools for running TPC-DS benchmark
+on a CloudTik spark runtime cluster.
+
+Refer to [Run TPC-DS performance benchmark for Spark](https://github.com/oap-project/cloudtik/tree/main/tools/benchmarks/spark)
+for a detailed step-by-step guide.
+
+
diff --git a/docs/source/UserGuide/RunningOptimizedAnalytics/spark-optimizations.md b/docs/source/UserGuide/RunningOptimizedAnalytics/spark-optimizations.md
@@ -0,0 +1,40 @@
+# Spark Optimizations
+CloudTik Spark can run on both the up-stream Spark and a CloudTik optimized Spark.  
+CloudTik optimized Spark implemented quite a few important optimizations upon
+the up-stream Spark and thus provides better performance.
+
+- [Runtime Filter Optimization](#runtime-filter-optimization)
+- [Top N Optimization](#top-n-optimization)
+- [Size-based Join Reorder Optimization](#size-based-join-reorder-optimization)
+- [Distinct Before Intersect Optimization](#distinct-before-intersect-optimization)
+- [Flatten Scalar Subquery Optimization](#flatten-scalar-subquery-optimization)
+- [Flatten Single Row Aggregate Optimization](#flatten-single-row-aggregate-optimization)
+
+## Runtime Filter Optimization
+Row-level runtime filters can improve the performance of some joins by pre-filtering one side (Filter Application Side)
+of a join using a Bloom filter or semi-join filters generated from the values from the other side (Filter Creation Side) of the join. 
+
+## Top N Optimization
+For the rank functions (row_number|rank|dense_rank),
+the rank of a key computed on partial dataset is always <= its final rank computed on the whole dataset.
+It’s safe to discard rows with partial rank > k.  Select local top-k records within each partition,
+and then compute the global top-k. This can help reduce the shuffle amount.
+We introduce a new node RankLimit to filter out unnecessary rows based on rank computed on partial dataset.
+We can enable this feature by setting spark.sql.rankLimit.enabled to true.
+
+## Size-based Join Reorder Optimization
+The default behavior in Spark is to join tables from left to right, as listed in the query.
+We can improve query performance by reordering joins involving tables with filters.
+You can enable this feature by setting the Spark configuration parameter spark.sql.optimizer.sizeBasedJoinReorder.enabled to true.
+
+## Distinct Before Intersect Optimization
+This optimization optimizes joins when using INTERSECT.
+Queries using INTERSECT are automatically converted to use a left-semi join.
+When this optimization is enabled, the query optimizer will try to estimate whether pushing the DISTINCT operator
+to the children of INTERSECT has benefit according to the duplication of data in the left table and the right table.
+You can enable it by setting the Spark property spark.sql.optimizer.distinctBeforeIntersect.enabled.
+
+## Flatten Scalar Subquery Optimization
+
+## Flatten Single Row Aggregate Optimization
+
diff --git a/docs/source/UserGuide/running-optimized-ai.rst b/docs/source/UserGuide/running-optimized-ai.rst
@@ -0,0 +1,7 @@
+Running Optimized AI
+==============
+
+.. toctree::
+   :maxdepth: 1
+
+   RunningOptimizedAI/running-ml-examples.md
diff --git a/docs/source/UserGuide/running-optimized-analytics.rst b/docs/source/UserGuide/running-optimized-analytics.rst
@@ -0,0 +1,9 @@
+Running Optimized Analytics with Spark
+==============
+
+.. toctree::
+   :maxdepth: 1
+
+   RunningOptimizedAnalytics/spark-optimizations.md
+   RunningOptimizedAnalytics/running-analytics-benchmarks.md
+
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -3,17 +3,14 @@ CloudTik
 
 CloudTik is a cloud scale platform for distributed analytics and AI on public cloud providers including AWS, Azure, GCP, and so on.
 CloudTik enables users or enterprises to easily create and manage analytics and AI platform on public clouds,
-with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads in minutes or hours.
+with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads
+in hours or in even minutes instead of spending months to construct and optimize the platform.
 
-- Built upon Cloud compute engines and Cloud storages
-
-- Support major public Cloud providers (AWS, Azure, GCP, and more to come)
-
-- Powerful and Optimized: Out of box and optimized runtimes for Analytics and AI
-
-- Simplified and Unified: Easy to use and unified operate experiences on all Clouds
-
-- Open and Flexible: Open architecture and user in full control, fully open-source and user transparent.
+CloudTik provides:
+- Scalable, robust, and unified control plane and runtimes for all public clouds
+- Out of box optimized runtimes for analytics and AI (Spark, ...)
+- Support of major public cloud providers - AWS, Azure, GCP, Kubernetes (EKS, AKS, and GKE) and more
+- A fully open architecture and open-sourced solution
 
 .. toctree::
    :maxdepth: 1
@@ -34,6 +31,8 @@ with out-of-box optimized functionalities and performance, and to go quickly to
    UserGuide/creating-workspace.md
    UserGuide/creating-cluster.md
    UserGuide/managing-cluster.md
+   UserGuide/running-optimized-analytics.rst
+   UserGuide/running-optimized-ai.rst
    UserGuide/advanced-configurations.rst
 
 

diff --git a/example/ml/README.md b/example/ml/README.md
@@ -1,35 +1,72 @@
-# Run AI Examples on Cloudtik cluster
+# Distributed Machine Learning and Deep Learning Examples
 
+Here we provide a guide for you to run ML/DL related examples based on CloudTik ML runtime
+which includes selected ML/DL frameworks and libraries.
 
-Here we provide a guide for you to run some AI related examples using many popular open-source libraries. 
+We provide these examples in two forms:
+1. Python jobs
+2. Jupyter notebooks
 
+## Running the examples using python jobs
 
-#### MLflow HyperOpt Scikit_Learn
+### Running Spark Distributed Deep Learning example
+This example runs Spark distributed deep learning with Hyperopt, Horovod and Tensorflow.
+It trains a simple ConvNet on the MNIST dataset using Keras + Horovod using Cloudtik Spark Runtime.
 
-This notebook uses scikit-learn to illustrate a complete end-to-end example of loading data, training a model, distributed hyperparameter tuning, and model inference. It also illustrates how to use MLflow and Model Registry.
+Download [Spark Deep Learning with Horovod and Tensorflow](jobs/spark-mlflow-hyperopt-horovod-tensorflow.py)
+and execute:
+```
+cloudtik submit /path/to/your-cluster-config.yaml local-download-path/spark-mlflow-hyperopt-horovod-tensorflow.py -f "your-cloud-storage-fsdir"
+```
 
-1. Uploading example notebook and run
-
-Upload example notebook [MLflow_HyperOpt_Scikit-Learn](notebooks/mlflow_hyperopt_scikit_learn.ipynb) to JupyterLab or $HOME/runtime/jupyter.
+Replace the cloud storage fsdir with the workspace cloud storage uri or hdfs dir. For example S3,  "s3a://cloudtik-workspace-bucket"
 
-2. Open this notebook on JupyterLab, and choose the Python 3 kernel to run the notebook.
 
-3. Checking the MLflow Server
+### Running Spark Distributed Machine Learning example
+This example runs Spark distributed training using scikit-learn.
+It illustrates a complete end-to-end example of loading data, training a model, distributed hyperparameter tuning, and model inference
+under CloudTik Spark cluster. It also illustrates how to use MLflow and model Registry.
 
-Type the below URL in your browser
+Download [Spark Machine Learning with Scikit-Learn](jobs/spark-mlflow-hyperopt-scikit.py)
+and execute:
 ```
-http://<head_IP>:5001
+cloudtik submit /path/to/your-cluster-config.yaml local-download-path/spark-mlflow-hyperopt-scikit.py -f "your-cloud-storage-fsdir"
 ```
-If the MLflow Server have started, you can see the below UI.
 
-![MLflowUI](images/MLflowUI.png)
+Replace the cloud storage fsdir with the workspace cloud storage uri or hdfs dir. For example S3,  "s3a://cloudtik-workspace-bucket"
 
 
+## Running the examples using Jupyter notebooks
 
-#### Mlflow HyperOpt Horovod on Spark
+### Running Spark Distributed Deep Learning example
 
-This example is to train a simple ConvNet on the MNIST dataset using Keras + Horovod using Cloudtik Spark Runtime.
+This notebook example runs Spark distributed deep learning with Hyperopt, Horovod and Tensorflow.
+It trains a simple ConvNet on the MNIST dataset using Keras + Horovod using Cloudtik Spark Runtime.
 
-1. Upload example notebook [Mlflow_HyperOpt_Horovod-on-Spark](notebooks/mlflow_hyperopt_horovod_on_spark.ipynb) to JupyterLab or $HOME/runtime/jupyter.
+1. Upload notebook [Spark Deep Learning with Horovod and Tensorflow](notebooks/spark-mlflow-hyperopt-horovod-tensorflow.ipynb) to JupyterLab.
+You can also download and cloudtik rsync-up the file to ~/jupyter of cluster head:
+
+```
+cloudtik rsync-up /path/to/your-cluster-config.yaml local-download-path/spark-mlflow-hyperopt-horovod-tensorflow.ipynb ~/jupyter/spark-mlflow-hyperopt-horovod-tensorflow.ipynb
+```
+
+2. Open this notebook on JupyterLab, and choose the Python 3 kernel to run the notebook.
+
+3. Optionally, you can check the training experiments and model registry through MLflow Web UI after the notebook finishes.
+
+### Running Spark Distributed Machine Learning example
+
+This notebook example runs Spark distributed training using scikit-learn.
+It illustrates a complete end-to-end example of loading data, training a model, distributed hyperparameter tuning, and model inference
+under CloudTik Spark cluster. It also illustrates how to use MLflow and model Registry.
+
+1. Upload notebook [Spark Machine Learning with Scikit-Learn](notebooks/spark-mlflow-hyperopt-scikit.ipynb) to JupyterLab.
+You can also download and cloudtik rsync-up the file to ~/jupyter of cluster head:
+
+```
+cloudtik rsync-up /path/to/your-cluster-config.yaml local-download-path/spark-mlflow-hyperopt-scikit.ipynb ~/jupyter/spark-mlflow-hyperopt-scikit.ipynb
+```
+
+2. Open this notebook on JupyterLab, and choose the Python 3 kernel to run the notebook.
 
-2. Open this notebook on JupyterLab, and choose the Python 3 kernel to run the notebook.
+3. Optionally, you can check the training experiments and model registry through MLflow Web UI after the notebook finishes.
diff --git a/example/ml/images/MLflowUI.png b/example/ml/images/MLflowUI.png