Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto-scale experiment plan #419

Merged
merged 4 commits into from
Oct 24, 2017
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions doc/autoscale_experiment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Auto-scaling Experiment Design

## Environment Requirement

- Kubernetes cluster with 1.6.x installed.
- PaddleCloud with latest develop branch installed.
- At least 4 kubernetes nodes, each node should have 2 GPU cards at least.
- Dataset prepared to multiple files with the RecordIO format.

## Experiment Metric
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不确定是否叫Metric? 对比的维度怎么表示?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

或者这里列一个表格更清楚一些?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

每个Test Case下加了一个表格,实际实验结果可以增加采样点用图表来表示。


- Computing resource utils(requests / total) for the cluster.
- A total number of the pods for all training job.

## Before Starting The Experiment

- We will use [recognize_digits](https://github.com/PaddlePaddle/cloud/tree/develop/demo/recognize_digits) as the training job for the demo.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the demos in book should be tested!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- We have 240 CPU cores and 80 GPU cards totally.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不需要说我们现在有多少资源。可以列一个表格,最终实验的环境填写到这个表格里

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以不写真实的资源数,这里其实也只是为了计算实验数据时使用的。


## Test Cases

### Comparing the auto-scaling training job and the general training job

- Submit the general training job
1. Submit a job(job-A), which requests 100 trainer Pods(1 CPU cores per Pod), 2 pservers and 1 master.
1. Submit another job(job-B), which requests 200 trainer Pods(1 CPU cores per Pod), 2 pservers and 1 master.
1. The job-B will be the pending status only if job-A finished because there are not enough CPU cores for the requests.
- Submit the auto-scaling training job
1. Submit a job(job-A), which requests 100 trainer Pods(1 CPU core per Pod, min-instances is 50, max-instances is 500), 2 pservers and 1 master, And then job-A will be scaled up to immediately to use the maximum free resources(max 500 trainer Pods).
1. Submit another job(job-B), which requests 200 trainer Pods(1 CPU core per Pod, min-instances is 50, max-instances is 200), 2 pservers and 1 master.
1. Job-A will be scaled down and job-A and job-B will run in the cluster at the same time, and they will use the maximum free resources.

- Experiment metrics
1. Compare the **CPU utils** with auto-scaling training job and general training job.
Copy link
Collaborator

@helinwang helinwang Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe can add cluster wide overall CPU / GPU utils.

$ kubectl describe nodes | grep -A 2 -e "^\\s*CPU Requests"
  CPU Requests	CPU Limits	Memory Requests	Memory Limits
  ------------	----------	---------------	-------------
  3865m (96%)	3600m (90%)	2760Mi (47%)	2770Mi (47%)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with CPU utils, but maybe there is no difference between the CPU and GPU resource for the auto-scaling feature? How about we only use CPU as the computing resource?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

或者我们可以按 @typhoonzero#419 (comment) 这里提到的第三种场景,测试CPU和GPU混合调度的场景,但感觉这可能不属于auto-scaling的特性范围了。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要不测的时候CPU和GPU utils都测一下(应该就是一行命令的事情),用不用GPU utils最后再决定?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU utils的数据采集可能会稍微复杂一些(需要扫描Pod,或者读取influxDB中的数据来获取,Kubernetes API无法直接取到)不过可以都测一下。

1. Compare the **training time** for each job.
1. Compare the **average waiting time** for each job.

- Experiment result example:

metrics | auto-scaling training job| general training job
-- | -- | --
training time | 6h | 8h
average waiting time | 0 | 2h
CPU utils | 100% | 60%

### Hybrid Deployment with Online Serving and Offline Training Job

In the general cluster, we will deploy some online serving such as Nginx cluster, Dataset serving such as MySQL and some offline training Job. we will deploy some Nginx Pods to simulate the production environment.

- Deploy Nginx Pods in the cluster, configure HPA on Nginx Deployment.
- Submit a training Job, which requests 100 trainer Pods(2 pservers, 1 master, min-instance=2, max-instance=100), the trainers will be scaled immediately to use the maximum free resources in the cluster.
- Increase the QPS of the Nginx serving, the Nginx pods count will be scaled up by HPA, and the training job will be scaled down by TrainingJob controller.
- Experiment metrics
1. CPU utils for the cluster(requests / total).
1. Trainer Pods count.
- Experiment result example

metrics | QPS(1w) | QPS(10w) | QPS(50w)
-- | -- | -- | --
Trainer Pods | 100 | 80 | 50
Nginx Pods | 80 | 100 | 150
CPU utils| 100% | 100% | 100%