Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaling cluster experiment design #413

Closed
typhoonzero opened this issue Oct 19, 2017 · 4 comments
Closed

Autoscaling cluster experiment design #413

typhoonzero opened this issue Oct 19, 2017 · 4 comments
Assignees

Comments

@typhoonzero
Copy link
Collaborator

typhoonzero commented Oct 19, 2017

Evironment requiremnets:

  • Cluster with kubernetes 1.6.x installed, best to have paddlecloud installed.
  • At least 4 nodes, each node should have multiple GPUs
  • Dataset prepared to recordio format and split to multiple files.

Test cases:

  1. Simple:
    • start a mnist training job with 2 pservers, 2 trainers and 1 master, 2~100 pods. Trainer will scale immediently to use maximum free resources in the cluster.
    • start another deployment simulating online cluster loads. Trainer should scale down if not enouph resource left.
  2. Simple with GPU:
    • same to case 1, but the mnist job requests GPU resource.
  3. Many Jobs
    • start 5~10 mnist training jobs with different parallelism settings, scale 2~100 pods. Jobs should scale equally and fair.
    • each time add one load to cluster, see the down scaling of the jobs.
@putcn
Copy link

putcn commented Oct 19, 2017

👍
can we also publish a white paper describing this experiment with the PR article?

@typhoonzero
Copy link
Collaborator Author

@putcn sure. Put this experiment in the PR article is also fine.

@Yancey1989
Copy link
Collaborator

Yancey1989 commented Oct 19, 2017

Some additional ideas:

  • Computing Resources
    • CPU: 120 Cores
    • GPU: 8 GPU
  • Experiment metrics:
    • Computing utils, maybe we can use (requests/total) as the target.
    • Training time for each job.
  • Additional Test Case
    • Inference serving with 5 pods in the cluster.
    • Start a training job with 100 trainers(2 pservers + 1 master)
    • Start another training job with 20 trainers ( 2 pservers + 1 master)
    • Compare the experiment features.

@helinwang
Copy link
Collaborator

helinwang commented Oct 19, 2017

Some additional ideas:

We need to test the cluster running Pods other than PaddlePaddle Pods (e.g., nginx, databases), and show that the training job is scaled down when the QPS (actually we will measure the CPU limit, but we can plot the QPS) of the nginx server increases. This use case is very good for training that only requires CPU and memory resources - share the CPU and memory with the general purpose Pods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants