Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluid distributed training benchmark #7410

Merged
merged 3 commits into from
Jan 12, 2018
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions benchmark/cluster/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Cluster Training Benchmark

## Setup

- Platform
- Kubernetes: v1.6.2
- Linux Kernel: v3.10.0

- Resource
- CPU: 10 Cores per Pod
- Memory: 5GB per Pod

- Docker Image

We use different base Docker Image to run the benchmark on Kubernetes:
- PaddlePaddle v2: paddlepaddle/paddle:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use a static tag, so when latest tag updates, this benchmark still can be reproduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but we don't have a static tag for fluid distributed training, how about a commit ID?

- PaddlePaddle Fluid: paddlepaddle/paddle:latest
- TensorFlow: tensorflow/tensorflow:latest

- Model
A digits recognize model and MNIST dataset is used in this benchmark.
Copy link
Contributor

@helinwang helinwang Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this model is too small. Maybe vgg-16 (probably around 500MB) is more closer to the real usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


## Compare the Performance

- Variable
- Batch Size of training data.
- PServer count of the training job.

- Invariant
- The number of trainers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the trainer count we plan to try?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
And @typhoonzero reminds me that we need to measure the parallel efficiency by increasing the trainer count.

- The resource of trainer/pserver Pod.

- Metrics
- We use `batch/sec` to measure the training performance.

### BatchSize

| BatchSize | 64 | 128 | 256 | 512 |
| -- | -- | -- | -- | -- |
| PaddlePaddle Fluid | - | - | - | - |
| PaddlePaddle v2 | - | - | - | - |
| TensorFlow | - | - | - | - |

### PServer Count

| PServer Count | 10 | 20 | 40 | 80 |
| -- | -- | -- | -- | -- |
| PaddlePaddle Fluid | - | - | - | - |
| PaddlePaddle v2 | - | - | - | - |
| TensorFlow | - | - | - | - |

## Reproduce the benchmark

TODO