From 4183a124a4cc683127c2e613287aad8ec6bde13e Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Wed, 10 Jan 2018 19:47:48 +0800 Subject: [PATCH 1/3] add cluster training bencharmk design --- benchmark/cluster/README.md | 54 +++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) create mode 100644 benchmark/cluster/README.md diff --git a/benchmark/cluster/README.md b/benchmark/cluster/README.md new file mode 100644 index 0000000000000..d2c68b6ada80a --- /dev/null +++ b/benchmark/cluster/README.md @@ -0,0 +1,54 @@ +# Cluster Training Benchmark + +## Setup + +- Platform + - Kubernetes: v1.6.2 + - Linux Kernel: v3.10.0 + +- Resource + - CPU: 10 Cores per Pod + - Memory: 5GB per Pod + +- Docker Image + + We use different base Docker Image to run the benchmark on Kubernetes: + - PaddlePaddle v2: paddlepaddle/paddle:latest + - PaddlePaddle Fluid: paddlepaddle/paddle:latest + - TensorFlow: tensorflow/tensorflow:latest + +- Model + A digits recognize model and MNIST dataset is used in this benchmark. + +## Compare the Performance + +- Variable + - Batch Size of training data. + - PServer count of the training job. + +- Invariant + - The number of trainers. + - The resource of trainer/pserver Pod. + +- Metrics + - We use `batch/sec` to measure the training performance. + +### BatchSize + +| BatchSize | 64 | 128 | 256 | 512 | +| -- | -- | -- | -- | -- | +| PaddlePaddle Fluid | - | - | - | - | +| PaddlePaddle v2 | - | - | - | - | +| TensorFlow | - | - | - | - | + +### PServer Count + +| PServer Count | 10 | 20 | 40 | 80 | +| -- | -- | -- | -- | -- | +| PaddlePaddle Fluid | - | - | - | - | +| PaddlePaddle v2 | - | - | - | - | +| TensorFlow | - | - | - | - | + +## Reproduce the benchmark + +TODO From 97e480aa10118cc46d138c3956a8bda647424588 Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Thu, 11 Jan 2018 17:30:05 +0800 Subject: [PATCH 2/3] update by comment --- benchmark/cluster/README.md | 48 +++++++++++++++++++++++++++---------- 1 file changed, 36 insertions(+), 12 deletions(-) diff --git a/benchmark/cluster/README.md b/benchmark/cluster/README.md index d2c68b6ada80a..674e04df85b79 100644 --- a/benchmark/cluster/README.md +++ b/benchmark/cluster/README.md @@ -13,42 +13,66 @@ - Docker Image We use different base Docker Image to run the benchmark on Kubernetes: - - PaddlePaddle v2: paddlepaddle/paddle:latest - - PaddlePaddle Fluid: paddlepaddle/paddle:latest - - TensorFlow: tensorflow/tensorflow:latest + - PaddlePaddle v2: paddlepaddle/paddle:[commit-id] + - PaddlePaddle Fluid: paddlepaddle/paddle:0.10.0 + - TensorFlow: tensorflow/tensorflow:1.5.0-rc0 - Model - A digits recognize model and MNIST dataset is used in this benchmark. + vgg16 is used in this benchmark. -## Compare the Performance +## Cases - Variable - Batch Size of training data. - PServer count of the training job. + - The number of trainers. - Invariant - - The number of trainers. - The resource of trainer/pserver Pod. -- Metrics - - We use `batch/sec` to measure the training performance. +### Measure the Performance for Different Batch Size -### BatchSize +- PServer Count: 40 +- Trainer Count: 100 +- Metrics: mini-batch / sec -| BatchSize | 64 | 128 | 256 | 512 | +| Batch Size | 32 | 64 | 128 | 256 | | -- | -- | -- | -- | -- | | PaddlePaddle Fluid | - | - | - | - | | PaddlePaddle v2 | - | - | - | - | | TensorFlow | - | - | - | - | -### PServer Count +### Measure the Performance for Different PServer Count + +- Trainer Count: 100 +- Batch Size: 64 +- Metrics: mini-batch / sec -| PServer Count | 10 | 20 | 40 | 80 | +| PServer Count | 10 | 20 | 40 | 60 | | -- | -- | -- | -- | -- | | PaddlePaddle Fluid | - | - | - | - | | PaddlePaddle v2 | - | - | - | - | | TensorFlow | - | - | - | - | +### Measure Parallel Efficiency By Increasing Trainer Count + +- PServer Count: 20 +- Batch Size: 64 +- Metrics: + +$S = \div(T1, TN)$ + +which S is the ratio of T1 over TN, training time of 1 and N trainers. +The parallel efficiency is: + +$E = \div(S, N)$ + +| Trainer Counter | 1 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 | +| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | +| PaddlePaddle Fluid | - | - | - | - | - | - | - | - | - | - | - | +| PaddlePaddle v2 | - | - | - | - | - | - | - | - | - | - | - | - | +| TensorFlow | - | - | - | - | - | - | - | - | - | - | - | - | - | + ## Reproduce the benchmark TODO From c86e744e9db36cccf89cb09922529e88e6e25fed Mon Sep 17 00:00:00 2001 From: Yancey1989 Date: Thu, 11 Jan 2018 18:53:10 +0800 Subject: [PATCH 3/3] update by comment --- benchmark/cluster/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/benchmark/cluster/README.md b/benchmark/cluster/README.md index 674e04df85b79..b619613ea7a5b 100644 --- a/benchmark/cluster/README.md +++ b/benchmark/cluster/README.md @@ -13,8 +13,8 @@ - Docker Image We use different base Docker Image to run the benchmark on Kubernetes: - - PaddlePaddle v2: paddlepaddle/paddle:[commit-id] - - PaddlePaddle Fluid: paddlepaddle/paddle:0.10.0 + - PaddlePaddle v2: paddlepaddle/paddle:0.11.0 + - PaddlePaddle Fluid: paddlepaddle/paddle:[commit-id] - TensorFlow: tensorflow/tensorflow:1.5.0-rc0 - Model