Autoscaling cluster experiment design #413

typhoonzero · 2017-10-19T05:36:35Z

Evironment requiremnets:

Cluster with kubernetes 1.6.x installed, best to have paddlecloud installed.
At least 4 nodes, each node should have multiple GPUs
Dataset prepared to recordio format and split to multiple files.

Test cases:

Simple:
- start a mnist training job with 2 pservers, 2 trainers and 1 master, 2~100 pods. Trainer will scale immediently to use maximum free resources in the cluster.
- start another deployment simulating online cluster loads. Trainer should scale down if not enouph resource left.
Simple with GPU:
- same to case 1, but the mnist job requests GPU resource.
Many Jobs
- start 5~10 mnist training jobs with different parallelism settings, scale 2~100 pods. Jobs should scale equally and fair.
- each time add one load to cluster, see the down scaling of the jobs.

The text was updated successfully, but these errors were encountered:

putcn · 2017-10-19T05:40:09Z

👍
can we also publish a white paper describing this experiment with the PR article?

typhoonzero · 2017-10-19T06:00:55Z

@putcn sure. Put this experiment in the PR article is also fine.

Yancey1989 · 2017-10-19T06:15:05Z

Some additional ideas:

Computing Resources
- CPU: 120 Cores
- GPU: 8 GPU
Experiment metrics:
- Computing utils, maybe we can use (requests/total) as the target.
- Training time for each job.
Additional Test Case
- Inference serving with 5 pods in the cluster.
- Start a training job with 100 trainers(2 pservers + 1 master)
- Start another training job with 20 trainers ( 2 pservers + 1 master)
- Compare the experiment features.

helinwang · 2017-10-19T22:28:56Z

Some additional ideas:

We need to test the cluster running Pods other than PaddlePaddle Pods (e.g., nginx, databases), and show that the training job is scaled down when the QPS (actually we will measure the CPU limit, but we can plot the QPS) of the nginx server increases. This use case is very good for training that only requires CPU and memory resources - share the CPU and memory with the general purpose Pods.

typhoonzero assigned helinwang, putcn, gongweibao and Yancey1989 Oct 19, 2017

This was referenced Oct 19, 2017

Autoscaling Fault-tolerant Experiment Plan. #395

Closed

auto-scale experiment plan #419

Merged

Yancey1989 closed this as completed in #419 Oct 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaling cluster experiment design #413

Autoscaling cluster experiment design #413

typhoonzero commented Oct 19, 2017 •

edited

Loading

putcn commented Oct 19, 2017

typhoonzero commented Oct 19, 2017

Yancey1989 commented Oct 19, 2017 •

edited

Loading

helinwang commented Oct 19, 2017 •

edited

Loading

Autoscaling cluster experiment design #413

Autoscaling cluster experiment design #413

Comments

typhoonzero commented Oct 19, 2017 • edited Loading

putcn commented Oct 19, 2017

typhoonzero commented Oct 19, 2017

Yancey1989 commented Oct 19, 2017 • edited Loading

helinwang commented Oct 19, 2017 • edited Loading

typhoonzero commented Oct 19, 2017 •

edited

Loading

Yancey1989 commented Oct 19, 2017 •

edited

Loading

helinwang commented Oct 19, 2017 •

edited

Loading