Skip to content
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.

Commit

Permalink
Update the doc with the current version and update the tpcds guide (#809
Browse files Browse the repository at this point in the history
)
  • Loading branch information
jerrychenhf authored Sep 2, 2022
1 parent b910d63 commit f8cc01c
Show file tree
Hide file tree
Showing 5 changed files with 78 additions and 27 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,6 @@ jobs:
- name: build wheel
run: |
bash ./build.sh
curl -X PUT --upload-file ./python/dist/cloudtik-0.9.0-cp37-cp37m-manylinux2014_x86_64.whl http://23.95.96.95:8000/$GITHUB_ACTOR/cloudtik-0.9.0-cp37-cp37m-manylinux2014_x86_64.whl
curl -X PUT --upload-file ./python/dist/cloudtik-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl http://23.95.96.95:8000/$GITHUB_ACTOR/cloudtik-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl
21 changes: 14 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,19 +40,19 @@ Take AWS for example,

```
# if running CloudTik on aws
pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp37-cp37m-manylinux2014_x86_64.whl"
pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl"
```

Replace `cloudtik[aws]` with `clouditk[azure]` or `cloudtik[gcp]` if you want to create clusters on Azure or GCP.
Use `cloudtik[all]` if you want to manage clusters with all supported Cloud providers.

You can install the latest CloudTik wheels via the following links. These daily releases do not go through the full release process.
The following table shows the installation links for latest CloudTik wheels of supported Python versions.

| Linux | Installation |
|:-----------|:---------------------------------------------------------------------------------------------------------------------------------------------------|
| Python 3.9 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp39-cp39-manylinux2014_x86_64.whl" ` |
| Python 3.8 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp38-cp38-manylinux2014_x86_64.whl" ` |
| Python 3.7 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp37-cp37m-manylinux2014_x86_64.whl" ` |
| Python 3.9 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp39-cp39-manylinux2014_x86_64.whl" ` |
| Python 3.8 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp38-cp38-manylinux2014_x86_64.whl" ` |
| Python 3.7 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl" ` |


### 3. Authentication to Cloud Providers API
Expand Down Expand Up @@ -88,6 +88,7 @@ variable as described in the [Setting the environment variable](https://cloud.go
on your working machine.

### 4. Creating a Workspace for Clusters.
Once you authenticated with your cloud provider, you can start to create a Workspace.

CloudTik uses **Workspace** concept to easily manage shared Cloud resources such as VPC network resources,
identity and role resources, firewall or security groups, and cloud storage resources.
Expand All @@ -113,7 +114,7 @@ provider:
- 0.0.0.0/0
```
*NOTE:* `0.0.0.0/0` in `allowed_ssh_sources` will allow any IP addresses to connect to your cluster as long as it has the cluster private key.
For more security, make sure to change from `0.0.0.0/0` to restricted CIDR ranges for your case.
For more security, you need to change from `0.0.0.0/0` to restricted CIDR ranges for your case.

Use the following command to create and provision a Workspace:

Expand All @@ -123,7 +124,12 @@ cloudtik workspace create /path/to/your-workspace-config.yaml

Check `example/cluster` folder for more Workspace configuration file examples.

### 5. Starting a cluster with default runtimes
If you encounter problems on creating a Workspace, a common cause is that your current login account
for the cloud doesn't have enough privileges to create some resources such as VPC, storages, public ip and so on.
Make sure your current account have enough privileges. An admin or owner role will give the latest chance to have
all these privileges.

### 5. Starting a cluster with Spark runtime

Now you can start a cluster running Spark by default:

Expand Down Expand Up @@ -173,6 +179,7 @@ auth:
```

The cluster key will be created automatically for AWS and GCP if not specified.
The created private key file can be found in .ssh folder of your home folder.
For Azure, you need to generate an RSA key pair manually (use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair).
and configure the public and private key as following,

Expand Down
19 changes: 9 additions & 10 deletions docs/source/GettingStarted/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,20 +27,13 @@ Take AWS for example,

```
# if running CloudTik on aws
pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp37-cp37m-manylinux2014_x86_64.whl"
pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl"
```

Replace `cloudtik[aws]` with `clouditk[azure]` or `cloudtik[gcp]` if you want to create clusters on Azure or GCP.
Use `cloudtik[all]` if you want to manage clusters with all supported Cloud providers.

You can install the latest CloudTik wheels via the following links. These daily releases do not go through the full release process.

| Linux | Installation |
|:-----------|:---------------------------------------------------------------------------------------------------------------------------------------------------|
| Python 3.9 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp39-cp39-manylinux2014_x86_64.whl" ` |
| Python 3.8 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp38-cp38-manylinux2014_x86_64.whl" ` |
| Python 3.7 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp37-cp37m-manylinux2014_x86_64.whl" ` |

Please refer to [User Guide: Installation](../UserGuide/installation.md) for the package links for other Python versions.

### 3. Authentication to Cloud Providers API

Expand Down Expand Up @@ -75,6 +68,7 @@ variable as described in the [Setting the environment variable](https://cloud.go
on your working machine.

### 4. Creating a Workspace for Clusters.
Once you authenticated with your cloud provider, you can start to create a Workspace.

CloudTik uses **Workspace** concept to easily manage shared Cloud resources such as VPC network resources,
identity and role resources, firewall or security groups, and cloud storage resources.
Expand All @@ -100,7 +94,7 @@ provider:
- 0.0.0.0/0
```
*NOTE:* `0.0.0.0/0` in `allowed_ssh_sources` will allow any IP addresses to connect to your cluster as long as it has the cluster private key.
For more security, make sure to change from `0.0.0.0/0` to restricted CIDR ranges for your case.
For more security, you need to change from `0.0.0.0/0` to restricted CIDR ranges for your case.

Use the following command to create and provision a Workspace:

Expand All @@ -110,6 +104,11 @@ cloudtik workspace create /path/to/your-workspace-config.yaml

Check `example/cluster` folder for more Workspace configuration file examples.

If you encounter problems on creating a Workspace, a common cause is that your current login account
for the cloud doesn't have enough privileges to create some resources such as VPC, storages, public ip and so on.
Make sure your current account have enough privileges. An admin or owner role will give the latest chance to have
all these privileges.

### 5. Starting a cluster with default runtimes

Now you can start a cluster running Spark by default:
Expand Down
13 changes: 6 additions & 7 deletions docs/source/UserGuide/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,16 @@ conda create -n cloudtik -y python=3.7
conda activate cloudtik
```

## Installing CloudTik from Daily Releases

You can install the latest CloudTik wheels via the following links. These daily releases do not go through the full release process.
To install these wheels, use the following `pip` command and wheels on different Cloud providers:
## Installing CloudTik

The following table shows the installation links for latest CloudTik wheels of supported Python versions.
To install these wheels, use the following `pip` command and wheels on different cloud providers:

| Linux | Installation |
|:-----------|:---------------------------------------------------------------------------------------------------------------------------------------------------|
| Python 3.9 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp39-cp39-manylinux2014_x86_64.whl" ` |
| Python 3.8 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp38-cp38-manylinux2014_x86_64.whl" ` |
| Python 3.7 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.0-cp37-cp37m-manylinux2014_x86_64.whl" ` |
| Python 3.9 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp39-cp39-manylinux2014_x86_64.whl" ` |
| Python 3.8 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp38-cp38-manylinux2014_x86_64.whl" ` |
| Python 3.7 | `pip install -U "cloudtik[aws] @ https://d30257nes7d4fq.cloudfront.net/downloads/cloudtik/cloudtik-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl" ` |

Replace `cloudtik[aws]` with `clouditk[azure]` or `cloudtik[gcp]` if you want to create clusters on Azure or GCP.
Use `cloudtik[all]` if you want to manage clusters with all supported Cloud providers.
Expand Down
50 changes: 48 additions & 2 deletions tools/benchmarks/spark/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,47 @@
# Run TPC-DS performance benchmark for Spark on Cloudtik cluster

## 1. Create a new Cloudtik cluster
## 1. Create a new Cloudtik cluster with TPC-DS toolkit
To generate data and run TPC-DS benchmark on Cloudtik cluster, some tools must be installed in advance.
We provide an installation script to simplify the installation of these dependencies. You only need to add the following bootstrap_commands in the cluster configuration file.
You have several options to do this.

### Option 1: Use a CloudTik Spark runtime image with TPC-DS toolkit installed (Recommended)
In your cluster config under docker key, configure the Spark runtime image with TPC-DS toolkit installed.

```buildoutcfg
docker:
image: "cloudtik/spark-runtime-tpcds:nightly"
```

This method is preferred as the toolkit is precompiled and installed without impacting cluster starting time.

### Option 2: Use bootstrap commands to compile and install the TPC-DS toolkit
We provide an installation script to simplify the installation of these dependencies.
You only need to add the following bootstrap_commands in the cluster configuration file when you start a cluster.
```buildoutcfg
bootstrap_commands:
- wget -P ~/ https://raw.githubusercontent.com/oap-project/cloudtik/main/tools/benchmarks/spark/scripts/bootstrap-benchmark.sh &&
bash ~/bootstrap-benchmark.sh --workload=tpcds
```
Please note that the toolkit compiling usually takes a long time which will make the cluster ready time much longer than usual.

### Option 3: Use exec commands to run compile and install the TPC-DS toolkit on all nodes
If you cluster already started, you can run the compiling and installing command on all nodes to achieve the same.
```buildoutcfg
cloudtik exec your-cluster-config.yaml "wget -P ~/ https://raw.githubusercontent.com/oap-project/cloudtik/main/tools/benchmarks/spark/scripts/bootstrap-benchmark.sh && bash ~/bootstrap-benchmark.sh --workload=tpcds" --all-nodes
```

Please note that the toolkit compiling usually takes a long time.
You may need to run the command with --tmux option for background execution
for avoiding terminal disconnection in the middle. And you don't know its completion.

## 2. Generate data
Use "cloudtik status your-cluster-config.yaml" to check the all workers are in ready (update-to-date) status.
If workers are not ready, even you submit a job, the job will still in pending for lack of workers.

We provided the datagen scala script **[tpcds-datagen.scala](./scripts/tpcds-datagen.scala)** for you to generate data in different scales.
Execute the following command to submit and run the datagen script on the cluster,
Expand All @@ -19,6 +50,12 @@ cloudtik submit your-cluster-config.yaml $CLOUTIK_HOME/tools/benchmarks/spark/sc
```
Replace the cluster configuration file, the paths, spark.driver.scale, spark.driver.fsdir values in the above command for your case.

The above command will submit and run the job in foreground and possible need a long time.
you may need to run the command with --tmux option for background execution
for avoiding terminal disconnection in the middle. And you don't get the command result.
Please refer to [CloudTik Submitting Jobs](https://cloudtik.readthedocs.io/en/latest/UserGuide/AdvancedTasks/submitting-jobs.html) for
the details for run job in background.

## 3. Run TPC-DS power test

We provided the power test scala script **[tpcds-power-test.scala](./scripts/tpcds-power-test.scala)** for users to run TPC-DS power test with Cloudtik cluster.
Expand All @@ -27,3 +64,12 @@ Execute the following command to submit and run the power test script on the clu
cloudtik submit your-cluster-config.yaml $CLOUTIK_HOME/tools/benchmarks/spark/scripts/tpcds-power-test.scala --conf spark.driver.scaleFactor=1 --conf spark.driver.fsdir="s3a://s3_bucket_name" --conf spark.sql.shuffle.partitions="\$[\$(cloudtik head info --worker-cpus)*2]" --conf spark.driver.iterations=1 --jars \$HOME/runtime/benchmark-tools/spark-sql-perf/target/scala-2.12/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar
```
Replace the cluster configuration file, the paths, spark.driver.scale, spark.driver.fsdir, spark.driver.iterations values in the above command for your case.

Just like data gen, you may need to run the command with --tmux option for background execution.

When the test is done, you have two options to get the query time results:
1. The query time results will be printed at the end.
2. The query time results will be saved to the configured storage with following location pattern:
"${fsdir}/shared/data/results/tpcds_${format}/${scaleFactor}/"
(replace the fsdir and scaleFactor with the value when submitting job. if 'format' is not specified, it defaults to 'parquet')
You can get the saved file using hadoop command after you attached to the cluster.

0 comments on commit f8cc01c

Please sign in to comment.