Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The moving mean and variance are not updated in BatchNorm when using ParallelDo Op. #9386

Closed
qingqing01 opened this issue Mar 26, 2018 · 0 comments

Comments

@qingqing01
Copy link
Contributor

qingqing01 commented Mar 26, 2018

  • 背景及问题:
    models 图像分类、MobileNet-SSD检测、OCR识别任务都遇到问题: 使用了ParallelDo Op训练单机多卡的模型,遇到train集的cost收敛,但test集的评估结果完全错误。

  • 原因:
    单机多卡时BatchNorm里的moving mean/variance,也就是: https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/layers/nn.py#L1517 这里的参数没有更新,导致test集使用的moving mean/var 还是一开始随机初始化(或加载的初始化模型)的参数。

  • Debug方法:
    分别在【不使用】ParallelDo和【使用】ParalellDo时,打印batch_norm_xx.w_1/2参数,可发现moving mean/var是否变化。比如在我在MobileNet-SSD任务里,打印代码如下:

    def test(pass_id):
        map_eval.reset(exe)
        test_map = None
        for _, data in enumerate(test_reader()):
            m1, test_map, loss_v = exe.run(test_program,
                               feed=feeder.feed(data),
                               fetch_list=[map, accum_map, loss])
        print("Test {0}, map {1}, loss {2}".format(pass_id, test_map[0], loss_v[0]))
        t = fluid.global_scope().find_var('batch_norm_34.w_1').get_tensor()
        t = np.array(t).astype(np.float32).flatten()
        print('batch_norm_34.w_1', t[0:20])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant