Update the doc with the current version and update the tpcds guide (#809

)
oap-project · Sep 2, 2022 · f8cc01c · f8cc01c
1 parent b910d63
commit f8cc01c
Show file tree

Hide file tree

Showing 5 changed files with 78 additions and 27 deletions.
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -21,6 +21,6 @@ jobs:
     - name: build wheel
       run: |
         bash ./build.sh
-        curl -X PUT --upload-file ./python/dist/cloudtik-0.9.0-cp37-cp37m-manylinux2014_x86_64.whl http://23.95.96.95:8000/$GITHUB_ACTOR/cloudtik-0.9.0-cp37-cp37m-manylinux2014_x86_64.whl
+        curl -X PUT --upload-file ./python/dist/cloudtik-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl http://23.95.96.95:8000/$GITHUB_ACTOR/cloudtik-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl
 
 
diff --git a/README.md b/README.md
@@ -40,19 +40,19 @@ Take AWS for example,
 
 ```
 # if running CloudTik on aws
-pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp37-cp37m-manylinux2014_x86_64.whl"
+pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl"
 ```
 
 Replace `cloudtik[aws]` with `clouditk[azure]` or `cloudtik[gcp]` if you want to create clusters on Azure or GCP.
 Use `cloudtik[all]` if you want to manage clusters with all supported Cloud providers.
 
-You can install the latest CloudTik wheels via the following links. These daily releases do not go through the full release process.
+The following table shows the installation links for latest CloudTik wheels of supported Python versions. 
 
 | Linux      | Installation                                                                                                                                       |
 |:-----------|:---------------------------------------------------------------------------------------------------------------------------------------------------|
-| Python 3.9 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp39-cp39-manylinux2014_x86_64.whl" `     |
-| Python 3.8 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp38-cp38-manylinux2014_x86_64.whl" `     |
-| Python 3.7 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp37-cp37m-manylinux2014_x86_64.whl" `    |
+| Python 3.9 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp39-cp39-manylinux2014_x86_64.whl" `     |
+| Python 3.8 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp38-cp38-manylinux2014_x86_64.whl" `     |
+| Python 3.7 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl" `    |
 
 
 ### 3. Authentication to Cloud Providers API
@@ -88,6 +88,7 @@ variable as described in the [Setting the environment variable](https://cloud.go
 on your working machine.
 
 ### 4. Creating a Workspace for Clusters.
+Once you authenticated with your cloud provider, you can start to create a Workspace.
 
 CloudTik uses **Workspace** concept to easily manage shared Cloud resources such as VPC network resources,
 identity and role resources, firewall or security groups, and cloud storage resources.
@@ -113,7 +114,7 @@ provider:
       - 0.0.0.0/0
 ```
 *NOTE:* `0.0.0.0/0` in `allowed_ssh_sources` will allow any IP addresses to connect to your cluster as long as it has the cluster private key.
-For more security, make sure to change from `0.0.0.0/0` to restricted CIDR ranges for your case.
+For more security, you need to change from `0.0.0.0/0` to restricted CIDR ranges for your case.
 
 Use the following command to create and provision a Workspace:
 
@@ -123,7 +124,12 @@ cloudtik workspace create /path/to/your-workspace-config.yaml
 
 Check `example/cluster` folder for more Workspace configuration file examples.
 
-### 5. Starting a cluster with default runtimes
+If you encounter problems on creating a Workspace, a common cause is that your current login account
+for the cloud doesn't have enough privileges to create some resources such as VPC, storages, public ip and so on.
+Make sure your current account have enough privileges. An admin or owner role will give the latest chance to have
+all these privileges.
+
+### 5. Starting a cluster with Spark runtime
 
 Now you can start a cluster running Spark by default:
 
@@ -173,6 +179,7 @@ auth:
 ```
 
 The cluster key will be created automatically for AWS and GCP if not specified.
+The created private key file can be found in .ssh folder of your home folder.
 For Azure, you need to generate an RSA key pair manually (use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair).
 and configure the public and private key as following,
 

diff --git a/docs/source/GettingStarted/quick-start.md b/docs/source/GettingStarted/quick-start.md
@@ -27,20 +27,13 @@ Take AWS for example,
 
 ```
 # if running CloudTik on aws
-pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp37-cp37m-manylinux2014_x86_64.whl"
+pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl"
 ```
 
 Replace `cloudtik[aws]` with `clouditk[azure]` or `cloudtik[gcp]` if you want to create clusters on Azure or GCP.
 Use `cloudtik[all]` if you want to manage clusters with all supported Cloud providers.
 
-You can install the latest CloudTik wheels via the following links. These daily releases do not go through the full release process.
-
-| Linux      | Installation                                                                                                                                       |
-|:-----------|:---------------------------------------------------------------------------------------------------------------------------------------------------|
-| Python 3.9 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp39-cp39-manylinux2014_x86_64.whl" `     |
-| Python 3.8 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp38-cp38-manylinux2014_x86_64.whl" `     |
-| Python 3.7 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp37-cp37m-manylinux2014_x86_64.whl" `    |
-
+Please refer to [User Guide: Installation](../UserGuide/installation.md) for the package links for other Python versions.
 
 ### 3. Authentication to Cloud Providers API
 
@@ -75,6 +68,7 @@ variable as described in the [Setting the environment variable](https://cloud.go
 on your working machine.
 
 ### 4. Creating a Workspace for Clusters.
+Once you authenticated with your cloud provider, you can start to create a Workspace.
 
 CloudTik uses **Workspace** concept to easily manage shared Cloud resources such as VPC network resources,
 identity and role resources, firewall or security groups, and cloud storage resources.
@@ -100,7 +94,7 @@ provider:
       - 0.0.0.0/0
 ```
 *NOTE:* `0.0.0.0/0` in `allowed_ssh_sources` will allow any IP addresses to connect to your cluster as long as it has the cluster private key.
-For more security, make sure to change from `0.0.0.0/0` to restricted CIDR ranges for your case.
+For more security, you need to change from `0.0.0.0/0` to restricted CIDR ranges for your case.
 
 Use the following command to create and provision a Workspace:
 
@@ -110,6 +104,11 @@ cloudtik workspace create /path/to/your-workspace-config.yaml
 
 Check `example/cluster` folder for more Workspace configuration file examples.
 
+If you encounter problems on creating a Workspace, a common cause is that your current login account
+for the cloud doesn't have enough privileges to create some resources such as VPC, storages, public ip and so on.
+Make sure your current account have enough privileges. An admin or owner role will give the latest chance to have
+all these privileges.
+
 ### 5. Starting a cluster with default runtimes
 
 Now you can start a cluster running Spark by default:

diff --git a/docs/source/UserGuide/installation.md b/docs/source/UserGuide/installation.md
@@ -19,17 +19,16 @@ conda create -n cloudtik -y python=3.7
 conda activate cloudtik
 ```
 
-## Installing CloudTik from Daily Releases
-
-You can install the latest CloudTik wheels via the following links. These daily releases do not go through the full release process. 
-To install these wheels, use the following `pip` command and wheels on different Cloud providers:
+## Installing CloudTik
 
+The following table shows the installation links for latest CloudTik wheels of supported Python versions.
+To install these wheels, use the following `pip` command and wheels on different cloud providers:
 
 | Linux      | Installation                                                                                                                                       |
 |:-----------|:---------------------------------------------------------------------------------------------------------------------------------------------------|
-| Python 3.9 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp39-cp39-manylinux2014_x86_64.whl" `     |
-| Python 3.8 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp38-cp38-manylinux2014_x86_64.whl" `     |
-| Python 3.7 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp37-cp37m-manylinux2014_x86_64.whl" `    |
+| Python 3.9 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp39-cp39-manylinux2014_x86_64.whl" `     |
+| Python 3.8 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp38-cp38-manylinux2014_x86_64.whl" `     |
+| Python 3.7 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl" `    |
 
 Replace `cloudtik[aws]` with `clouditk[azure]` or `cloudtik[gcp]` if you want to create clusters on Azure or GCP.
 Use `cloudtik[all]` if you want to manage clusters with all supported Cloud providers.

diff --git a/tools/benchmarks/spark/README.md b/tools/benchmarks/spark/README.md
@@ -1,16 +1,47 @@
 # Run TPC-DS performance benchmark for Spark on Cloudtik cluster
 
-## 1. Create a new Cloudtik cluster
+## 1. Create a new Cloudtik cluster with TPC-DS toolkit
 To generate data and run TPC-DS benchmark on Cloudtik cluster, some tools must be installed in advance.
-We provide an installation script to simplify the installation of these dependencies. You only need to add the following bootstrap_commands in the cluster configuration file.
+You have several options to do this.
+
+### Option 1: Use a CloudTik Spark runtime image with TPC-DS toolkit installed (Recommended)
+In your cluster config under docker key, configure the Spark runtime image with TPC-DS toolkit installed.
+
+```buildoutcfg
+
+docker:
+    image: "cloudtik/spark-runtime-tpcds:nightly"
+
+```
+
+This method is preferred as the toolkit is precompiled and installed without impacting cluster starting time.
+
+### Option 2: Use bootstrap commands to compile and install the TPC-DS toolkit
+We provide an installation script to simplify the installation of these dependencies.
+You only need to add the following bootstrap_commands in the cluster configuration file when you start a cluster.
 ```buildoutcfg
 
 bootstrap_commands:
     - wget -P ~/ https://raw.githubusercontent.com/oap-project/cloudtik/main/tools/benchmarks/spark/scripts/bootstrap-benchmark.sh &&
         bash ~/bootstrap-benchmark.sh  --workload=tpcds
 ```
+Please note that the toolkit compiling usually takes a long time which will make the cluster ready time much longer than usual.
+
+### Option 3: Use exec commands to run compile and install the TPC-DS toolkit on all nodes
+If you cluster already started, you can run the compiling and installing command on all nodes to achieve the same.
+```buildoutcfg
+
+cloudtik exec your-cluster-config.yaml "wget -P ~/ https://raw.githubusercontent.com/oap-project/cloudtik/main/tools/benchmarks/spark/scripts/bootstrap-benchmark.sh && bash ~/bootstrap-benchmark.sh --workload=tpcds" --all-nodes
+
+```
+
+Please note that the toolkit compiling usually takes a long time.
+You may need to run the command with --tmux option for background execution
+for avoiding terminal disconnection in the middle. And you don't know its completion.
 
 ## 2. Generate data
+Use "cloudtik status your-cluster-config.yaml" to check the all workers are in ready (update-to-date) status.
+If workers are not ready, even you submit a job, the job will still in pending for lack of workers.
 
 We provided the datagen scala script **[tpcds-datagen.scala](./scripts/tpcds-datagen.scala)** for you to generate data in different scales.
 Execute the following command to submit and run the datagen script on the cluster,
@@ -19,6 +50,12 @@ cloudtik submit your-cluster-config.yaml $CLOUTIK_HOME/tools/benchmarks/spark/sc
 ```
 Replace the cluster configuration file, the paths, spark.driver.scale, spark.driver.fsdir values in the above command for your case.
 
+The above command will submit and run the job in foreground and possible need a long time.
+you may need to run the command with --tmux option for background execution
+for avoiding terminal disconnection in the middle. And you don't get the command result.
+Please refer to [CloudTik Submitting Jobs](https://cloudtik.readthedocs.io/en/latest/UserGuide/AdvancedTasks/submitting-jobs.html) for
+the details for run job in background.
+
 ## 3. Run TPC-DS power test
 
 We provided the power test scala script **[tpcds-power-test.scala](./scripts/tpcds-power-test.scala)** for users to run TPC-DS power test with Cloudtik cluster.
@@ -27,3 +64,12 @@ Execute the following command to submit and run the power test script on the clu
 cloudtik submit your-cluster-config.yaml $CLOUTIK_HOME/tools/benchmarks/spark/scripts/tpcds-power-test.scala --conf spark.driver.scaleFactor=1 --conf spark.driver.fsdir="s3a://s3_bucket_name" --conf spark.sql.shuffle.partitions="\$[\$(cloudtik head info --worker-cpus)*2]" --conf spark.driver.iterations=1 --jars \$HOME/runtime/benchmark-tools/spark-sql-perf/target/scala-2.12/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar
 ```
 Replace the cluster configuration file, the paths, spark.driver.scale, spark.driver.fsdir, spark.driver.iterations values in the above command for your case. 
+
+Just like data gen, you may need to run the command with --tmux option for background execution.
+
+When the test is done, you have two options to get the query time results:
+1. The query time results will be printed at the end.
+2. The query time results will be saved to the configured storage with following location pattern:
+"${fsdir}/shared/data/results/tpcds_${format}/${scaleFactor}/"
+(replace the fsdir and scaleFactor with the value when submitting job. if 'format' is not specified, it defaults to 'parquet')
+You can get the saved file using hadoop command after you attached to the cluster.