Skip to content

Commit

Permalink
auto-scale experiment plan (#419)
Browse files Browse the repository at this point in the history
Auto-scale experiment design
  • Loading branch information
Yancey1989 authored Oct 24, 2017
1 parent d5816c5 commit 7bbcd45
Showing 1 changed file with 63 additions and 0 deletions.
63 changes: 63 additions & 0 deletions doc/autoscale_experiment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Auto-scaling Experiment Design

## Environment Requirement

- Kubernetes cluster with 1.6.x installed.
- PaddleCloud with latest develop branch installed.
- At least 4 kubernetes nodes, each node should have 2 GPU cards at least.
- Dataset prepared to multiple files with the RecordIO format.

## Experiment Metric

- Computing resource utils(requests / total) for the cluster.
- A total number of the pods for all training job.

## Before Starting The Experiment

- All the demos in [book](https://github.com/PaddlePaddle/book) should be tested.
- We will use [recognize_digits](https://github.com/PaddlePaddle/cloud/tree/develop/demo/recognize_digits) as the training job for the demo.
- We have 240 CPU cores and 80 GPU cards totally.

## Test Cases

### Comparing the auto-scaling training job and the general training job

- Submit the general training job
1. Submit a job(job-A), which requests 100 trainer Pods(1 CPU cores per Pod), 2 pservers and 1 master.
1. Submit another job(job-B), which requests 200 trainer Pods(1 CPU cores per Pod), 2 pservers and 1 master.
1. The job-B will be the pending status only if job-A finished because there are not enough CPU cores for the requests.
- Submit the auto-scaling training job
1. Submit a job(job-A), which requests 100 trainer Pods(1 CPU core per Pod, min-instances is 50, max-instances is 500), 2 pservers and 1 master, And then job-A will be scaled up to immediately to use the maximum free resources(max 500 trainer Pods).
1. Submit another job(job-B), which requests 200 trainer Pods(1 CPU core per Pod, min-instances is 50, max-instances is 200), 2 pservers and 1 master.
1. Job-A will be scaled down and job-A and job-B will run in the cluster at the same time, and they will use the maximum free resources.

- Experiment metrics
1. Compare the **CPU utils** with auto-scaling training job and general training job.
1. Compare the **training time** for each job.
1. Compare the **average waiting time** for each job.

- Experiment result example:

metrics | auto-scaling training job| general training job
-- | -- | --
training time | 6h | 8h
average waiting time | 0 | 2h
CPU utils | 100% | 60%

### Hybrid Deployment with Online Serving and Offline Training Job

In the general cluster, we will deploy some online serving such as Nginx cluster, Dataset serving such as MySQL and some offline training Job. we will deploy some Nginx Pods to simulate the production environment.

- Deploy Nginx Pods in the cluster, configure HPA on Nginx Deployment.
- Submit a training Job, which requests 100 trainer Pods(2 pservers, 1 master, min-instance=2, max-instance=100), the trainers will be scaled immediately to use the maximum free resources in the cluster.
- Increase the QPS of the Nginx serving, the Nginx pods count will be scaled up by HPA, and the training job will be scaled down by TrainingJob controller.
- Experiment metrics
1. CPU utils for the cluster(requests / total).
1. Trainer Pods count.
- Experiment result example

metrics | QPS(1w) | QPS(10w) | QPS(50w)
-- | -- | -- | --
Trainer Pods | 100 | 80 | 50
Nginx Pods | 80 | 100 | 150
CPU utils| 100% | 100% | 100%

0 comments on commit 7bbcd45

Please sign in to comment.