Error occurs when setting the same prefix for two SyncBatchNorm #18466

acphile · 2020-06-02T02:45:56Z

import mxnet as mx
from mxnet.gluon.utils import split_and_load
ctx_list = [mx.cpu(0)]
x = mx.nd.random.uniform(shape=(5, 5), ctx=mx.cpu(0))
nch = x.shape[1] if x.ndim > 1 else 1
bn2 = mx.gluon.contrib.nn.SyncBatchNorm(in_channels=nch, num_devices=1, prefix="bn_")
bn2.initialize(ctx=ctx_list)
print(bn2.collect_params())
inputs2 = split_and_load(x, ctx_list, batch_axis=0)
for xi in inputs2:
    xi.attach_grad()
with mx.autograd.record():
    output2 = [bn2(xi) for xi in inputs2]
    loss2 = [(output ** 2).sum() for output in output2]
    mx.autograd.backward(loss2)
output2 = mx.nd.concat(*[output.as_in_context(x.context)
                         for output in output2], dim=0)
output2.asnumpy()

x = mx.nd.random.uniform(shape=(5, 10, 20), ctx=mx.cpu(0))
nch = x.shape[1] if x.ndim > 1 else 1
bn2 = mx.gluon.contrib.nn.SyncBatchNorm(in_channels=nch, num_devices=1, prefix="bn_")
bn2.initialize(ctx=ctx_list)
print(bn2.collect_params())
inputs2 = split_and_load(x, ctx_list, batch_axis=0)
for xi in inputs2:
    xi.attach_grad()
with mx.autograd.record():
    output2 = [bn2(xi) for xi in inputs2]
    loss2 = [(output ** 2).sum() for output in output2]
    mx.autograd.backward(loss2)

output2 = mx.nd.concat(*[output.as_in_context(x.context)
                         for output in output2], dim=0)
output2.asnumpy()

it will raise error:

Traceback (most recent call last):
  File "bn.py", line 36, in <module>
    output2.asnumpy()
  File "/home/ubuntu/raw/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 2566, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/ubuntu/raw/incubator-mxnet/python/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "../include/mshadow/./tensor_cpu-inl.h", line 153
Copy: Check failed: _dst.shape_ == _src.shape_ ((5,) vs. (10,)) : shape mismatch:(5,) vs (10,)

I doubt that it is because two SyncBatchNorm have the same names for their parameters.

The text was updated successfully, but these errors were encountered:

eric-haibin-lin · 2020-06-02T05:11:03Z

Looking at its implementation, it seems that it uses the name (prefix) to decide which tensors to sync: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/contrib/nn/basic_layers.py#L235

It's better to reflect this in the documentation

leezu · 2020-06-02T07:01:43Z

I don't think this is a documentation issue. There is a bug in the backend implementation using global state wrongly where it mustn't use global state at all.

zhanghang1989 · 2020-06-03T01:42:04Z

Don't use the same prefix for two BatchNorm layer.

leezu · 2020-06-03T01:49:18Z

Thanks @zhanghang1989. Unfortunately that's not an acceptable workaround for the usecase.

acphile added the Bug label Jun 2, 2020

leezu mentioned this issue Jun 22, 2020

Simplifying mxnet.gluon.block APIs #18413

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error occurs when setting the same prefix for two SyncBatchNorm #18466

Error occurs when setting the same prefix for two SyncBatchNorm #18466

acphile commented Jun 2, 2020

eric-haibin-lin commented Jun 2, 2020

leezu commented Jun 2, 2020

zhanghang1989 commented Jun 3, 2020

leezu commented Jun 3, 2020

Error occurs when setting the same prefix for two SyncBatchNorm #18466

Error occurs when setting the same prefix for two SyncBatchNorm #18466

Comments

acphile commented Jun 2, 2020

eric-haibin-lin commented Jun 2, 2020

leezu commented Jun 2, 2020

zhanghang1989 commented Jun 3, 2020

leezu commented Jun 3, 2020