-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluid distributed training benchmark #7410
Fluid distributed training benchmark #7410
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put this doc into design or a separated repo?
benchmark/cluster/README.md
Outdated
- Docker Image | ||
|
||
We use different base Docker Image to run the benchmark on Kubernetes: | ||
- PaddlePaddle v2: paddlepaddle/paddle:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use a static tag, so when latest
tag updates, this benchmark still can be reproduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but we don't have a static tag for fluid distributed training, how about a commit ID?
benchmark/cluster/README.md
Outdated
- TensorFlow: tensorflow/tensorflow:latest | ||
|
||
- Model | ||
A digits recognize model and MNIST dataset is used in this benchmark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this model is too small. Maybe vgg-16 (probably around 500MB) is more closer to the real usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
benchmark/cluster/README.md
Outdated
- PServer count of the training job. | ||
|
||
- Invariant | ||
- The number of trainers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the trainer count we plan to try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
And @typhoonzero reminds me that we need to measure the parallel efficiency by increasing the trainer count.
From @typhoonzero
Maybe not, I saw https://github.com/dzhwinter/benchmark is working for the Fluid benchmark, and I knew from @dzhwinter that will be merged into the Paddle repo in this week. |
benchmark/cluster/README.md
Outdated
- Docker Image | ||
|
||
We use different base Docker Image to run the benchmark on Kubernetes: | ||
- PaddlePaddle v2: paddlepaddle/paddle:[commit-id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v2 should use 0.10.0 tag, and fluid should use commit id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, since 0.10.0 does not support v2 distributed training, use 0.11.0 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
Fixed #7409