Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Error occurs when setting the same prefix for two SyncBatchNorm #18466

Open
acphile opened this issue Jun 2, 2020 · 4 comments
Open

Error occurs when setting the same prefix for two SyncBatchNorm #18466

acphile opened this issue Jun 2, 2020 · 4 comments
Labels

Comments

@acphile
Copy link
Contributor

acphile commented Jun 2, 2020

import mxnet as mx
from mxnet.gluon.utils import split_and_load
ctx_list = [mx.cpu(0)]
x = mx.nd.random.uniform(shape=(5, 5), ctx=mx.cpu(0))
nch = x.shape[1] if x.ndim > 1 else 1
bn2 = mx.gluon.contrib.nn.SyncBatchNorm(in_channels=nch, num_devices=1, prefix="bn_")
bn2.initialize(ctx=ctx_list)
print(bn2.collect_params())
inputs2 = split_and_load(x, ctx_list, batch_axis=0)
for xi in inputs2:
    xi.attach_grad()
with mx.autograd.record():
    output2 = [bn2(xi) for xi in inputs2]
    loss2 = [(output ** 2).sum() for output in output2]
    mx.autograd.backward(loss2)
output2 = mx.nd.concat(*[output.as_in_context(x.context)
                         for output in output2], dim=0)
output2.asnumpy()

x = mx.nd.random.uniform(shape=(5, 10, 20), ctx=mx.cpu(0))
nch = x.shape[1] if x.ndim > 1 else 1
bn2 = mx.gluon.contrib.nn.SyncBatchNorm(in_channels=nch, num_devices=1, prefix="bn_")
bn2.initialize(ctx=ctx_list)
print(bn2.collect_params())
inputs2 = split_and_load(x, ctx_list, batch_axis=0)
for xi in inputs2:
    xi.attach_grad()
with mx.autograd.record():
    output2 = [bn2(xi) for xi in inputs2]
    loss2 = [(output ** 2).sum() for output in output2]
    mx.autograd.backward(loss2)

output2 = mx.nd.concat(*[output.as_in_context(x.context)
                         for output in output2], dim=0)
output2.asnumpy()

it will raise error:

Traceback (most recent call last):
  File "bn.py", line 36, in <module>
    output2.asnumpy()
  File "/home/ubuntu/raw/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 2566, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/ubuntu/raw/incubator-mxnet/python/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "../include/mshadow/./tensor_cpu-inl.h", line 153
Copy: Check failed: _dst.shape_ == _src.shape_ ((5,) vs. (10,)) : shape mismatch:(5,) vs (10,)

I doubt that it is because two SyncBatchNorm have the same names for their parameters.

@acphile acphile added the Bug label Jun 2, 2020
@eric-haibin-lin
Copy link
Member

Looking at its implementation, it seems that it uses the name (prefix) to decide which tensors to sync: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/contrib/nn/basic_layers.py#L235

It's better to reflect this in the documentation

@leezu
Copy link
Contributor

leezu commented Jun 2, 2020

I don't think this is a documentation issue. There is a bug in the backend implementation using global state wrongly where it mustn't use global state at all.

@zhanghang1989
Copy link
Contributor

Don't use the same prefix for two BatchNorm layer.

@leezu
Copy link
Contributor

leezu commented Jun 3, 2020

Thanks @zhanghang1989. Unfortunately that's not an acceptable workaround for the usecase.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants