-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auto-scale experiment plan #419
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Auto-scaling Experiment Design | ||
|
||
## Environment Requirement | ||
|
||
- Kubernetes cluster with 1.6.x installed. | ||
- PaddleCloud with latest develop branch installed. | ||
- At least 4 kubernetes nodes, each node should have 2 GPU cards at least. | ||
- Dataset prepared to multiple files with the RecordIO format. | ||
|
||
## Experiment Metric | ||
|
||
- Computing resource utils(requests / total) for the cluster. | ||
- A total number of the pods for all training job. | ||
|
||
## Before Starting The Experiment | ||
|
||
- All the demos in [book](https://github.com/PaddlePaddle/book) should be tested. | ||
- We will use [recognize_digits](https://github.com/PaddlePaddle/cloud/tree/develop/demo/recognize_digits) as the training job for the demo. | ||
- We have 240 CPU cores and 80 GPU cards totally. | ||
|
||
## Test Cases | ||
|
||
### Comparing the auto-scaling training job and the general training job | ||
|
||
- Submit the general training job | ||
1. Submit a job(job-A), which requests 100 trainer Pods(1 CPU cores per Pod), 2 pservers and 1 master. | ||
1. Submit another job(job-B), which requests 200 trainer Pods(1 CPU cores per Pod), 2 pservers and 1 master. | ||
1. The job-B will be the pending status only if job-A finished because there are not enough CPU cores for the requests. | ||
- Submit the auto-scaling training job | ||
1. Submit a job(job-A), which requests 100 trainer Pods(1 CPU core per Pod, min-instances is 50, max-instances is 500), 2 pservers and 1 master, And then job-A will be scaled up to immediately to use the maximum free resources(max 500 trainer Pods). | ||
1. Submit another job(job-B), which requests 200 trainer Pods(1 CPU core per Pod, min-instances is 50, max-instances is 200), 2 pservers and 1 master. | ||
1. Job-A will be scaled down and job-A and job-B will run in the cluster at the same time, and they will use the maximum free resources. | ||
|
||
- Experiment metrics | ||
1. Compare the **CPU utils** with auto-scaling training job and general training job. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe can add cluster wide overall CPU / GPU utils.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree with CPU utils, but maybe there is no difference between the CPU and GPU resource for the auto-scaling feature? How about we only use CPU as the computing resource? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 或者我们可以按 @typhoonzero 在#419 (comment) 这里提到的第三种场景,测试CPU和GPU混合调度的场景,但感觉这可能不属于auto-scaling的特性范围了。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 要不测的时候CPU和GPU utils都测一下(应该就是一行命令的事情),用不用GPU utils最后再决定? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. GPU utils的数据采集可能会稍微复杂一些(需要扫描Pod,或者读取influxDB中的数据来获取,Kubernetes API无法直接取到)不过可以都测一下。 |
||
1. Compare the **training time** for each job. | ||
1. Compare the **average waiting time** for each job. | ||
|
||
- Experiment result example: | ||
|
||
metrics | auto-scaling training job| general training job | ||
-- | -- | -- | ||
training time | 6h | 8h | ||
average waiting time | 0 | 2h | ||
CPU utils | 100% | 60% | ||
|
||
### Hybrid Deployment with Online Serving and Offline Training Job | ||
|
||
In the general cluster, we will deploy some online serving such as Nginx cluster, Dataset serving such as MySQL and some offline training Job. we will deploy some Nginx Pods to simulate the production environment. | ||
|
||
- Deploy Nginx Pods in the cluster, configure HPA on Nginx Deployment. | ||
- Submit a training Job, which requests 100 trainer Pods(2 pservers, 1 master, min-instance=2, max-instance=100), the trainers will be scaled immediately to use the maximum free resources in the cluster. | ||
- Increase the QPS of the Nginx serving, the Nginx pods count will be scaled up by HPA, and the training job will be scaled down by TrainingJob controller. | ||
- Experiment metrics | ||
1. CPU utils for the cluster(requests / total). | ||
1. Trainer Pods count. | ||
- Experiment result example | ||
|
||
metrics | QPS(1w) | QPS(10w) | QPS(50w) | ||
-- | -- | -- | -- | ||
Trainer Pods | 100 | 80 | 50 | ||
Nginx Pods | 80 | 100 | 150 | ||
CPU utils| 100% | 100% | 100% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不确定是否叫Metric? 对比的维度怎么表示?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
或者这里列一个表格更清楚一些?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
每个Test Case下加了一个表格,实际实验结果可以增加采样点用图表来表示。