Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the first device share data with the global scope in parallel_do_op. #9398

Merged
merged 1 commit into from
Mar 27, 2018

Conversation

qingqing01
Copy link
Contributor

@qingqing01 qingqing01 commented Mar 27, 2018

Fix #9386

There is no need to copy data to the first device. Just make the first device share data with the global scope, since they are on the same device.

Copy link
Contributor

@panyx0718 panyx0718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments to explain the problem?

The global moving_mean and moving_variance is currently not correctly updated by the values calculated from sub_scopes (unlike trainable parameters). Perhaps ParallelExecutor has the similar problem to solve
@tonyyang-svail @reyoung

@qingqing01
Copy link
Contributor Author

qingqing01 commented Mar 27, 2018

Add some comments to explain the problem?

In #9386, the moving mean/variance in BN are un-trainable parameters. The trainable parameters will update in backward and copy to the sub-scope in each mini-batch before the forward. Different from other trainable parameters, the moving means/variances will not updated in backward, the parallel_do_op still copy the initialized parameters in the global scope.

This fix makes the first device share parameter address with the global scope. When the moving mean/variance in the first device is updated, they will also be updated in the global scope.

But for BN, only save moving mean/variance in the first device. Maybe we can merge them between multi-GPUs and multi-machines in the future.

@qingqing01 qingqing01 merged commit 25317bd into PaddlePaddle:develop Mar 27, 2018
@qingqing01 qingqing01 deleted the parallel_do_op branch November 14, 2019 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants