Fluid distributed training benchmark #7410

Yancey1989 · 2018-01-10T11:51:18Z

typhoonzero

Should we put this doc into design or a separated repo?

typhoonzero · 2018-01-10T12:42:29Z

benchmark/cluster/README.md

+- Docker Image
+
+  We use different base Docker Image to run the benchmark on Kubernetes:
+  - PaddlePaddle v2: paddlepaddle/paddle:latest


Should use a static tag, so when latest tag updates, this benchmark still can be reproduced.

Sure, but we don't have a static tag for fluid distributed training, how about a commit ID?

helinwang · 2018-01-11T00:32:45Z

benchmark/cluster/README.md

+  - TensorFlow: tensorflow/tensorflow:latest
+
+- Model
+  A digits recognize model and MNIST dataset is used in this benchmark.


I think this model is too small. Maybe vgg-16 (probably around 500MB) is more closer to the real usage.

helinwang · 2018-01-11T00:33:58Z

benchmark/cluster/README.md

+  - PServer count of the training job.
+
+- Invariant
+  - The number of trainers.


What is the trainer count we plan to try?

Done.
And @typhoonzero reminds me that we need to measure the parallel efficiency by increasing the trainer count.

Yancey1989 · 2018-01-11T09:34:07Z

From @typhoonzero

Should we put this doc into design or a separated repo?

Maybe not, I saw https://github.com/dzhwinter/benchmark is working for the Fluid benchmark, and I knew from @dzhwinter that will be merged into the Paddle repo in this week.

typhoonzero · 2018-01-11T10:50:08Z

benchmark/cluster/README.md

+- Docker Image
+
+  We use different base Docker Image to run the benchmark on Kubernetes:
+  - PaddlePaddle v2: paddlepaddle/paddle:[commit-id]


v2 should use 0.10.0 tag, and fluid should use commit id

Done, since 0.10.0 does not support v2 distributed training, use 0.11.0 .

typhoonzero

LGTM++

add cluster training bencharmk design

4183a12

Yancey1989 requested review from helinwang, dzhwinter and typhoonzero January 10, 2018 11:51

Yancey1989 changed the title ~~add cluster training bencharmk design~~ Fluid distributed training benchmark Jan 10, 2018

typhoonzero reviewed Jan 10, 2018

View reviewed changes

helinwang reviewed Jan 11, 2018

View reviewed changes

update by comment

97e480a

typhoonzero reviewed Jan 11, 2018

View reviewed changes

update by comment

c86e744

typhoonzero approved these changes Jan 11, 2018

View reviewed changes

Yancey1989 merged commit 5dbd537 into PaddlePaddle:develop Jan 12, 2018

Yancey1989 deleted the cluster_benchmark_design branch January 12, 2018 03:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluid distributed training benchmark #7410

Fluid distributed training benchmark #7410

Yancey1989 commented Jan 10, 2018

typhoonzero left a comment

typhoonzero Jan 10, 2018

Yancey1989 Jan 11, 2018

helinwang Jan 11, 2018 •

edited

Loading

Yancey1989 Jan 11, 2018

helinwang Jan 11, 2018

Yancey1989 Jan 11, 2018

Yancey1989 commented Jan 11, 2018

typhoonzero Jan 11, 2018

Yancey1989 Jan 11, 2018 •

edited

Loading

typhoonzero left a comment

Fluid distributed training benchmark #7410

Fluid distributed training benchmark #7410

Conversation

Yancey1989 commented Jan 10, 2018

typhoonzero left a comment

Choose a reason for hiding this comment

typhoonzero Jan 10, 2018

Choose a reason for hiding this comment

Yancey1989 Jan 11, 2018

Choose a reason for hiding this comment

helinwang Jan 11, 2018 • edited Loading

Choose a reason for hiding this comment

Yancey1989 Jan 11, 2018

Choose a reason for hiding this comment

helinwang Jan 11, 2018

Choose a reason for hiding this comment

Yancey1989 Jan 11, 2018

Choose a reason for hiding this comment

Yancey1989 commented Jan 11, 2018

typhoonzero Jan 11, 2018

Choose a reason for hiding this comment

Yancey1989 Jan 11, 2018 • edited Loading

Choose a reason for hiding this comment

typhoonzero left a comment

Choose a reason for hiding this comment

helinwang Jan 11, 2018 •

edited

Loading

Yancey1989 Jan 11, 2018 •

edited

Loading