Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluid distributed training benchmark #7410

Merged
merged 3 commits into from
Jan 12, 2018

Conversation

Yancey1989
Copy link
Contributor

Fixed #7409

@Yancey1989 Yancey1989 changed the title add cluster training bencharmk design Fluid distributed training benchmark Jan 10, 2018
Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this doc into design or a separated repo?

- Docker Image

We use different base Docker Image to run the benchmark on Kubernetes:
- PaddlePaddle v2: paddlepaddle/paddle:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use a static tag, so when latest tag updates, this benchmark still can be reproduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but we don't have a static tag for fluid distributed training, how about a commit ID?

- TensorFlow: tensorflow/tensorflow:latest

- Model
A digits recognize model and MNIST dataset is used in this benchmark.
Copy link
Contributor

@helinwang helinwang Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this model is too small. Maybe vgg-16 (probably around 500MB) is more closer to the real usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- PServer count of the training job.

- Invariant
- The number of trainers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the trainer count we plan to try?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
And @typhoonzero reminds me that we need to measure the parallel efficiency by increasing the trainer count.

@Yancey1989
Copy link
Contributor Author

From @typhoonzero

Should we put this doc into design or a separated repo?

Maybe not, I saw https://github.com/dzhwinter/benchmark is working for the Fluid benchmark, and I knew from @dzhwinter that will be merged into the Paddle repo in this week.

- Docker Image

We use different base Docker Image to run the benchmark on Kubernetes:
- PaddlePaddle v2: paddlepaddle/paddle:[commit-id]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v2 should use 0.10.0 tag, and fluid should use commit id

Copy link
Contributor Author

@Yancey1989 Yancey1989 Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, since 0.10.0 does not support v2 distributed training, use 0.11.0 .

Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++

@Yancey1989 Yancey1989 merged commit 5dbd537 into PaddlePaddle:develop Jan 12, 2018
@Yancey1989 Yancey1989 deleted the cluster_benchmark_design branch January 12, 2018 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants