We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
背景及问题: models 图像分类、MobileNet-SSD检测、OCR识别任务都遇到问题: 使用了ParallelDo Op训练单机多卡的模型,遇到train集的cost收敛,但test集的评估结果完全错误。
原因: 单机多卡时BatchNorm里的moving mean/variance,也就是: https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/layers/nn.py#L1517 这里的参数没有更新,导致test集使用的moving mean/var 还是一开始随机初始化(或加载的初始化模型)的参数。
Debug方法: 分别在【不使用】ParallelDo和【使用】ParalellDo时,打印batch_norm_xx.w_1/2参数,可发现moving mean/var是否变化。比如在我在MobileNet-SSD任务里,打印代码如下:
batch_norm_xx.w_1/2
def test(pass_id): map_eval.reset(exe) test_map = None for _, data in enumerate(test_reader()): m1, test_map, loss_v = exe.run(test_program, feed=feeder.feed(data), fetch_list=[map, accum_map, loss]) print("Test {0}, map {1}, loss {2}".format(pass_id, test_map[0], loss_v[0])) t = fluid.global_scope().find_var('batch_norm_34.w_1').get_tensor() t = np.array(t).astype(np.float32).flatten() print('batch_norm_34.w_1', t[0:20])
The text was updated successfully, but these errors were encountered:
No branches or pull requests
背景及问题:
models 图像分类、MobileNet-SSD检测、OCR识别任务都遇到问题: 使用了ParallelDo Op训练单机多卡的模型,遇到train集的cost收敛,但test集的评估结果完全错误。
原因:
单机多卡时BatchNorm里的moving mean/variance,也就是: https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/layers/nn.py#L1517 这里的参数没有更新,导致test集使用的moving mean/var 还是一开始随机初始化(或加载的初始化模型)的参数。
Debug方法:
分别在【不使用】ParallelDo和【使用】ParalellDo时,打印
batch_norm_xx.w_1/2
参数,可发现moving mean/var是否变化。比如在我在MobileNet-SSD任务里,打印代码如下:The text was updated successfully, but these errors were encountered: