Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

ResNet-50 is slower on Volta since #8302 #9874

Closed
Caenorst opened this issue Feb 23, 2018 · 12 comments
Closed

ResNet-50 is slower on Volta since #8302 #9874

Caenorst opened this issue Feb 23, 2018 · 12 comments

Comments

@Caenorst
Copy link
Contributor

Caenorst commented Feb 23, 2018

Description

I ran the Minimum reproducible example with the setup below at two different version (before and after #8302):
Here are the results:
d03182f (before #8302):
- real data: 5644 samples / s
- synthetic data: 5971 samples / s
c3e3a83 (after #8302):
- real data: 5461 samples / s
- synthetic data: 5740 samples / s
Latest:
- real data: 5425 samples / s
- synthetic data: 5817 samples / s

@ptrendx @DickJC123 @mkolod

Environment info (Required)

CPUs: Intel Xeon E5-2698 v4 (x2)
GPUs: Nvidia V100 (x8)

Build info (Required if built from source)

From the default config.mk (in make/config.mk) added:

USE_CUDA=1
USE_CUDNN=1
CUDA_ARCH := -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70
USE_CUDA_PATH=/usr/local/cuda
USE_LIBJPEG_TURBO=1
USE_LIBJPEG_TURBO_PATH=/usr
USE_NCCL=1

Minimum reproducible example

python /mxnet/example/image-classification/train_imagenet.py --benchmark 0 --gpu 0,1,2,3,4,5,6,7 --batch-size 1024 --num-epochs 1 --data-train /data/imagenet/train-480-val-256-recordio/train.rec --data-train-idx /data/imagenet/train-480-val-256-recordio/train.idx --data-val /data/imagenet/train-480-val-256-recordio/val.rec --disp-batches 100 --network resnet-v1 --num-layers 50 --data-nthreads 40 --min-random-scale 0.533 --max-random-shear-ratio 0 --max-random-rotate-angle 0 --max-random-h 0 --max-random-l 0 --max-random-s 0 --dtype float16 --kv-store device
@lupesko
Copy link
Contributor

lupesko commented Feb 26, 2018

@piiswrong @zheng-da - please take a look, this degradation may be related to your commit.

@rahul003
Copy link
Member

Are the speeds that you mention averages? If so, averaged over how many batches?

@Caenorst
Copy link
Contributor Author

It's averaged over 1200 batches, I'm ignoring the 100 first batches.

@cjolivier01
Copy link
Member

@zheng-da

@zheng-da
Copy link
Contributor

zheng-da commented Mar 2, 2018

I think I may know what is the potential cause of this problem. I'll fix it next week.

@zheng-da
Copy link
Contributor

zheng-da commented Mar 10, 2018

I searched all commits in PR #8302. I think I have found the commits that cause the perf issue. However, I failed to fix the problem. I created a branch that contains the commits. https://github.com/zheng-da/incubator-mxnet/tree/refactor_bn

Basically, the commits that refactor BatchNorm cause the issue.
zheng-da@338dbca
zheng-da@aa5e69e

@Caenorst could you help look into the issue? Thanks

@cjolivier01
Copy link
Member

Is it know what part of the commit is the problem?
Are the performance characteristics of thread_local known for the supported platforms?

@zheng-da
Copy link
Contributor

@Caenorst can you test it again? I measured the perf on p3. The PR #10116 should improve the perf by about 3% for your test case.

@vandanavk
Copy link
Contributor

@Caenorst did @zheng-da's PR improve the performance on your setup?

@vrakesh
Copy link
Contributor

vrakesh commented Nov 27, 2018

@Caenorst Has the performance, loss been negated post @zheng-da 's PR? If so requesting to close the issue.

@kalyc
Copy link
Contributor

kalyc commented Dec 10, 2018

@lanking520 requesting to close this issue due to lack of activity

@lanking520
Copy link
Member

@Caenorst Please feel free to reopen this issue if you are still facing this failure. Close it for now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants