Skip to content

Commit

Permalink
Add doc for ml runtime examples and spark optimizations (oap-project#924
Browse files Browse the repository at this point in the history
)
  • Loading branch information
jerrychenhf authored Oct 28, 2022
1 parent 628c3a3 commit fafab98
Show file tree
Hide file tree
Showing 10 changed files with 146 additions and 36 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@

CloudTik is a cloud scale platform for distributed analytics and AI on public cloud providers including AWS, Azure, GCP, and so on.
CloudTik enables users or enterprises to easily create and manage analytics and AI platform on public clouds,
with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads in minutes or hours
instead of spending months to construct and optimize the platform.
with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads
in hours or in even minutes instead of spending months to construct and optimize the platform.

We target CloudTik to be:
CloudTik provides:
- Scalable, robust, and unified control plane and runtimes for all public clouds
- Out of box optimized runtimes for analytics and AI (Spark, ...)
- Support major public cloud providers - AWS, Azure, GCP, Kubernetes (EKS, AKS, and GKE) and more
- Fully open architecture and open-sourced solution
- Support of major public cloud providers - AWS, Azure, GCP, Kubernetes (EKS, AKS, and GKE) and more
- A fully open architecture and open-sourced solution

## Getting Started with CloudTik

Expand Down
9 changes: 5 additions & 4 deletions docs/source/GettingStarted/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,14 @@ of various aspects including cost, performance, complexity and openness. CloudTi

CloudTik is a cloud scale platform for distributed analytics and AI on public cloud providers including AWS, Azure, GCP, and so on.
CloudTik enables users or enterprises to easily create and manage analytics and AI platform on public clouds,
with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads in minutes or hours
instead of spending months to construct and optimize the platform.
with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads
in hours or in even minutes instead of spending months to construct and optimize the platform.

CloudTik provides:
- Scalable, robust, and unified control plane and runtimes for all public clouds
- Out of box optimized runtimes for analytics and AI (Spark, ...)
- Support major public cloud providers - AWS, Azure, GCP, Kubernetes (EKS, AKS, and GKE) and more
- Fully open architecture and open-sourced solution
- Support of major public cloud providers - AWS, Azure, GCP, Kubernetes (EKS, AKS, and GKE) and more
- A fully open architecture and open-sourced solution

CloudTik offers the industry an open-sourced solution for users to build and run the production level analytics and
AI platform on Cloud in hours. CloudTik provides a powerful, but easy to use infrastructure to build and manage
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Running Machining Examples

CloudTik provides ready to run examples for demonstrating
how distributed machine learning and deep learning jobs can be implemented
in CloudTik Spark and ML runtime cluster.

Refer to [Distributed Machine Learning and Deep Learning Examples](https://github.com/oap-project/cloudtik/tree/main/example/ml)
for a detailed step-by-step guide.
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Running Analytics Benchmarks

CloudTik provides ready to use tools for running TPC-DS benchmark
on a CloudTik spark runtime cluster.

Refer to [Run TPC-DS performance benchmark for Spark](https://github.com/oap-project/cloudtik/tree/main/tools/benchmarks/spark)
for a detailed step-by-step guide.


Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Spark Optimizations
CloudTik Spark can run on both the up-stream Spark and a CloudTik optimized Spark.
CloudTik optimized Spark implemented quite a few important optimizations upon
the up-stream Spark and thus provides better performance.

- [Runtime Filter Optimization](#runtime-filter-optimization)
- [Top N Optimization](#top-n-optimization)
- [Size-based Join Reorder Optimization](#size-based-join-reorder-optimization)
- [Distinct Before Intersect Optimization](#distinct-before-intersect-optimization)
- [Flatten Scalar Subquery Optimization](#flatten-scalar-subquery-optimization)
- [Flatten Single Row Aggregate Optimization](#flatten-single-row-aggregate-optimization)

## Runtime Filter Optimization
Row-level runtime filters can improve the performance of some joins by pre-filtering one side (Filter Application Side)
of a join using a Bloom filter or semi-join filters generated from the values from the other side (Filter Creation Side) of the join.

## Top N Optimization
For the rank functions (row_number|rank|dense_rank),
the rank of a key computed on partial dataset is always <= its final rank computed on the whole dataset.
It’s safe to discard rows with partial rank > k. Select local top-k records within each partition,
and then compute the global top-k. This can help reduce the shuffle amount.
We introduce a new node RankLimit to filter out unnecessary rows based on rank computed on partial dataset.
We can enable this feature by setting spark.sql.rankLimit.enabled to true.

## Size-based Join Reorder Optimization
The default behavior in Spark is to join tables from left to right, as listed in the query.
We can improve query performance by reordering joins involving tables with filters.
You can enable this feature by setting the Spark configuration parameter spark.sql.optimizer.sizeBasedJoinReorder.enabled to true.

## Distinct Before Intersect Optimization
This optimization optimizes joins when using INTERSECT.
Queries using INTERSECT are automatically converted to use a left-semi join.
When this optimization is enabled, the query optimizer will try to estimate whether pushing the DISTINCT operator
to the children of INTERSECT has benefit according to the duplication of data in the left table and the right table.
You can enable it by setting the Spark property spark.sql.optimizer.distinctBeforeIntersect.enabled.

## Flatten Scalar Subquery Optimization

## Flatten Single Row Aggregate Optimization

7 changes: 7 additions & 0 deletions docs/source/UserGuide/running-optimized-ai.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Running Optimized AI
==============

.. toctree::
:maxdepth: 1

RunningOptimizedAI/running-ml-examples.md
9 changes: 9 additions & 0 deletions docs/source/UserGuide/running-optimized-analytics.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Running Optimized Analytics with Spark
==============

.. toctree::
:maxdepth: 1

RunningOptimizedAnalytics/spark-optimizations.md
RunningOptimizedAnalytics/running-analytics-benchmarks.md

19 changes: 9 additions & 10 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,14 @@ CloudTik

CloudTik is a cloud scale platform for distributed analytics and AI on public cloud providers including AWS, Azure, GCP, and so on.
CloudTik enables users or enterprises to easily create and manage analytics and AI platform on public clouds,
with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads in minutes or hours.
with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads
in hours or in even minutes instead of spending months to construct and optimize the platform.

- Built upon Cloud compute engines and Cloud storages

- Support major public Cloud providers (AWS, Azure, GCP, and more to come)

- Powerful and Optimized: Out of box and optimized runtimes for Analytics and AI

- Simplified and Unified: Easy to use and unified operate experiences on all Clouds

- Open and Flexible: Open architecture and user in full control, fully open-source and user transparent.
CloudTik provides:
- Scalable, robust, and unified control plane and runtimes for all public clouds
- Out of box optimized runtimes for analytics and AI (Spark, ...)
- Support of major public cloud providers - AWS, Azure, GCP, Kubernetes (EKS, AKS, and GKE) and more
- A fully open architecture and open-sourced solution

.. toctree::
:maxdepth: 1
Expand All @@ -34,6 +31,8 @@ with out-of-box optimized functionalities and performance, and to go quickly to
UserGuide/creating-workspace.md
UserGuide/creating-cluster.md
UserGuide/managing-cluster.md
UserGuide/running-optimized-analytics.rst
UserGuide/running-optimized-ai.rst
UserGuide/advanced-configurations.rst


Expand Down
71 changes: 54 additions & 17 deletions example/ml/README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,72 @@
# Run AI Examples on Cloudtik cluster
# Distributed Machine Learning and Deep Learning Examples

Here we provide a guide for you to run ML/DL related examples based on CloudTik ML runtime
which includes selected ML/DL frameworks and libraries.

Here we provide a guide for you to run some AI related examples using many popular open-source libraries.
We provide these examples in two forms:
1. Python jobs
2. Jupyter notebooks

## Running the examples using python jobs

#### MLflow HyperOpt Scikit_Learn
### Running Spark Distributed Deep Learning example
This example runs Spark distributed deep learning with Hyperopt, Horovod and Tensorflow.
It trains a simple ConvNet on the MNIST dataset using Keras + Horovod using Cloudtik Spark Runtime.

This notebook uses scikit-learn to illustrate a complete end-to-end example of loading data, training a model, distributed hyperparameter tuning, and model inference. It also illustrates how to use MLflow and Model Registry.
Download [Spark Deep Learning with Horovod and Tensorflow](jobs/spark-mlflow-hyperopt-horovod-tensorflow.py)
and execute:
```
cloudtik submit /path/to/your-cluster-config.yaml local-download-path/spark-mlflow-hyperopt-horovod-tensorflow.py -f "your-cloud-storage-fsdir"
```

1. Uploading example notebook and run

Upload example notebook [MLflow_HyperOpt_Scikit-Learn](notebooks/mlflow_hyperopt_scikit_learn.ipynb) to JupyterLab or $HOME/runtime/jupyter.
Replace the cloud storage fsdir with the workspace cloud storage uri or hdfs dir. For example S3, "s3a://cloudtik-workspace-bucket"

2. Open this notebook on JupyterLab, and choose the Python 3 kernel to run the notebook.

3. Checking the MLflow Server
### Running Spark Distributed Machine Learning example
This example runs Spark distributed training using scikit-learn.
It illustrates a complete end-to-end example of loading data, training a model, distributed hyperparameter tuning, and model inference
under CloudTik Spark cluster. It also illustrates how to use MLflow and model Registry.

Type the below URL in your browser
Download [Spark Machine Learning with Scikit-Learn](jobs/spark-mlflow-hyperopt-scikit.py)
and execute:
```
http://<head_IP>:5001
cloudtik submit /path/to/your-cluster-config.yaml local-download-path/spark-mlflow-hyperopt-scikit.py -f "your-cloud-storage-fsdir"
```
If the MLflow Server have started, you can see the below UI.

![MLflowUI](images/MLflowUI.png)
Replace the cloud storage fsdir with the workspace cloud storage uri or hdfs dir. For example S3, "s3a://cloudtik-workspace-bucket"


## Running the examples using Jupyter notebooks

#### Mlflow HyperOpt Horovod on Spark
### Running Spark Distributed Deep Learning example

This example is to train a simple ConvNet on the MNIST dataset using Keras + Horovod using Cloudtik Spark Runtime.
This notebook example runs Spark distributed deep learning with Hyperopt, Horovod and Tensorflow.
It trains a simple ConvNet on the MNIST dataset using Keras + Horovod using Cloudtik Spark Runtime.

1. Upload example notebook [Mlflow_HyperOpt_Horovod-on-Spark](notebooks/mlflow_hyperopt_horovod_on_spark.ipynb) to JupyterLab or $HOME/runtime/jupyter.
1. Upload notebook [Spark Deep Learning with Horovod and Tensorflow](notebooks/spark-mlflow-hyperopt-horovod-tensorflow.ipynb) to JupyterLab.
You can also download and cloudtik rsync-up the file to ~/jupyter of cluster head:

```
cloudtik rsync-up /path/to/your-cluster-config.yaml local-download-path/spark-mlflow-hyperopt-horovod-tensorflow.ipynb ~/jupyter/spark-mlflow-hyperopt-horovod-tensorflow.ipynb
```

2. Open this notebook on JupyterLab, and choose the Python 3 kernel to run the notebook.

3. Optionally, you can check the training experiments and model registry through MLflow Web UI after the notebook finishes.

### Running Spark Distributed Machine Learning example

This notebook example runs Spark distributed training using scikit-learn.
It illustrates a complete end-to-end example of loading data, training a model, distributed hyperparameter tuning, and model inference
under CloudTik Spark cluster. It also illustrates how to use MLflow and model Registry.

1. Upload notebook [Spark Machine Learning with Scikit-Learn](notebooks/spark-mlflow-hyperopt-scikit.ipynb) to JupyterLab.
You can also download and cloudtik rsync-up the file to ~/jupyter of cluster head:

```
cloudtik rsync-up /path/to/your-cluster-config.yaml local-download-path/spark-mlflow-hyperopt-scikit.ipynb ~/jupyter/spark-mlflow-hyperopt-scikit.ipynb
```

2. Open this notebook on JupyterLab, and choose the Python 3 kernel to run the notebook.

2. Open this notebook on JupyterLab, and choose the Python 3 kernel to run the notebook.
3. Optionally, you can check the training experiments and model registry through MLflow Web UI after the notebook finishes.
Binary file removed example/ml/images/MLflowUI.png
Binary file not shown.

0 comments on commit fafab98

Please sign in to comment.