Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SE-ResNeXt Optimization #8990

Closed
jacquesqiao opened this issue Mar 12, 2018 · 3 comments
Closed

SE-ResNeXt Optimization #8990

jacquesqiao opened this issue Mar 12, 2018 · 3 comments
Assignees

Comments

@jacquesqiao
Copy link
Member

jacquesqiao commented Mar 12, 2018

Background

project: https://github.com/PaddlePaddle/Paddle/projects/55
Profiling script:

Optimization methods and result

  1. Delete unused GPU memory during training.
  2. remove program.clone in Executor. (25% speedup) [Speed]speed up python executor in fluid #8729
  3. initialize NCCL once. (5%~6% speedup) [Speed]Avoid init_nccl for every steps. #8758
  4. use constant folding at compile time to reduce the number of calls to elementwise_mul ops at optimization time (5%~10% speedup) optimize optimizer learning rate #8873
  5. optimize elementwise related op -- use our own implementations, no longer depend on Eigen (speedup x10 for single op) [Speed] Optimize elementwise_mul_op gradient functor #8811

Status

  1. multi cards training has not been fully tested.
  2. need to profile acceleration ratio for multi cards.

Plan

Give a total profile after all the optimization is merged (@chengduoZH )

@chengduoZH
Copy link
Contributor

Maybe a lot of small Op kernels, like sgd_op, also need be optimized. It can be imaged that, if the model has 1000 parameters, it will call 1000 times of sgd_ops, this method is very time-consuming.

There are two strategies, one is to analyze the dependence of operations and insert those sgd_ops into the process of backward, the other is to replace sgd_op with sgd_group_op.

This issue(#8941) displays the result of the second strategy(using sgd_group_op).

@chengduoZH
Copy link
Contributor

chengduoZH commented Mar 13, 2018

Config and Env:

  • Input: 3 x 224 x 224
  • batch_size: 25
  • CentOS 6.3, Tesla P40, single card.

The comparison results before optimization:

  Speed Memory
Fluid(before) 1.95 sec/iter 18341 MB
PyTorch 1.154 sec/iter 13359 MB
Fluid/PyTorch 1.68 1.3729

After optimizing the speed:

  Speed Memory
Fluid(opti_speed) 1.45 sec/iter 17222 MB
PyTorch 1.154 sec/iter 13359 MB
Fluid/PyTorch 1.256499133 1.289168351

After optimizing the memory usage:

  Speed Memory
Fluid(opti_mem) 1.93  sec/iter 14388 MB
PyTorch 1.154 sec/iter 13359 MB
Fluid/PyTorch 1.672443674 1.077026724

@QiJune
Copy link
Member

QiJune commented Mar 13, 2018

Now, if we choose release memory policy, the memory occupation is almost the same with PyTorch.

However, delete_var operator will synchronize the CUDA stream before release unused memory, which will reduce computation performance.

We have to implement AsyncExecutor to run operators in parallel. This will solve this problem ultimately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants