Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

fixing batch_norm and layer_norm for large tensor nightly test #17805

Merged
merged 1 commit into from
Mar 16, 2020

Conversation

access2rohit
Copy link
Contributor

@access2rohit access2rohit commented Mar 10, 2020

Description

Enables large tensor support for following ops:

  1. batch_norm
  2. layer_norm

Fixes nightly large tensor failure. Recently more strict input size check was added to layer_norm in this PR: #17683 but that hasn't been added to batch_norm yet so it isn't failing currently but the shape assignment is currently incorrect as shown in the gdb logs below.

Please look at the lines marked by arrows in GDB logs

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Proof Of Correctness

layer_norm()

Before changes:

333	  const int channelCount = dshape[channelAxis];  <========
(gdb) info local
param = @0x555555cdb770: {<dmlc::Parameter<mxnet::op::BatchNormParam>> = {<No data fields>}, eps = 0.0010000000474974513, momentum = 0.899999976, fix_gamma = true, use_global_stats = false, output_mean_var = false, axis = 0,
  cudnn_off = false, min_calib_range = {is_none = true, val = {__data = "\000\000\000", __align = {<No data fields>}}}, max_calib_range = {is_none = true, val = {__data = "UU\000", __align = {<No data fields>}}}}
dshape = @0x5555572290a0: {<mxnet::Tuple<long>> = {static kStackCache = <optimized out>, ndim_ = 1, num_heap_allocated_ = 0, data_stack_ = {4300000000, 1, 4300000000, 0}, data_heap_ = 0x0}, <No data fields>}
channelAxis = 0
channelCount = 21845 <--------
(gdb) p dshape[channelAxis]
$1 = (long &) @0x5555572290a8: 4300000000 <--------
(gdb) n
335	  if (!mxnet::ndim_is_known(dshape)) {
(gdb) p channelCount
$2 = 5032704

After Changes:

Thread 1 "python3" hit Breakpoint 1, mxnet::op::BatchNormShape (attrs=..., in_shape=0x555556579d98, out_shape=0x555556579db0) at src/operator/nn/batch_norm.cc:333
333	  const index_t channelCount = dshape[channelAxis]; <========
(gdb) n
335	  if (!mxnet::ndim_is_known(dshape)) {
(gdb) info local
param = @0x555555cdb770: {<dmlc::Parameter<mxnet::op::BatchNormParam>> = {<No data fields>}, eps = 0.0010000000474974513, momentum = 0.899999976, fix_gamma = true, use_global_stats = false, output_mean_var = false, axis = 0,
  cudnn_off = false, min_calib_range = {is_none = true, val = {__data = "\000\000\000", __align = {<No data fields>}}}, max_calib_range = {is_none = true, val = {__data = "UU\000", __align = {<No data fields>}}}}
dshape = @0x5555572290a0: {<mxnet::Tuple<long>> = {static kStackCache = <optimized out>, ndim_ = 1, num_heap_allocated_ = 0, data_stack_ = {4300000000, 1, 4300000000, 0}, data_heap_ = 0x0}, <No data fields>}
channelAxis = 0
channelCount = 4300000000 <--------
(gdb) p dshape[channelAxis]
$1 = (long &) @0x5555572290a8: 4300000000  <--------

batch_norm()

Before changes:

Thread 1 "python3" hit Breakpoint 1, mxnet::op::LayerNormShape (attrs=..., in_shape=0x555556579dc8, out_shape=0x555556579de0) at src/operator/nn/layer_norm.cc:50
50	  const int channelCount = dshape[axis]; <========
(gdb) n
52	  if (!mxnet::ndim_is_known(dshape)) {
(gdb) p channelCount
$3 = 5032704   <--------
(gdb) p dshape[0]
$4 = (long &) @0x555556c21f58: 430000000 <--------
(gdb) info local
param = @0x7fffffff9418: {<dmlc::Parameter<mxnet::op::LayerNormParam>> = {<No data fields>}, axis = 0, eps = 9.99999975e-06, output_mean_var = false}
dshape = @0x555556c21f50: {<mxnet::Tuple<long>> = {static kStackCache = <optimized out>, ndim_ = 1, num_heap_allocated_ = 0, data_stack_ = {4300000000, 0, 0, 0}, data_heap_ = 0x0}, <No data fields>}
axis = 0
channelCount = 5032704
moments_shape = {<mxnet::Tuple<long>> = {static kStackCache = <optimized out>, ndim_ = -29512, num_heap_allocated_ = 32767, data_stack_ = {140737488326480, 140737488325376, 93825019642720, 140737488325376},
    data_heap_ = 0x7fff936c4de7
     <std::_Rb_tree<dmlc::parameter::FieldAccessEntry*, dmlc::parameter::FieldAccessEntry*, std::_Identity<dmlc::parameter::FieldAccessEntry*>, std::less<dmlc::parameter::FieldAccessEntry*>, std::allocator<dmlc::parameter::FieldAccessEntry*> >::_Alloc_node::operator()<dmlc::parameter::FieldAccessEntry* const&>(dmlc::parameter::FieldAccessEntry* const&) const+49>}, <No data fields>}

After Changes:

Thread 1 "python3" hit Breakpoint 2, mxnet::op::LayerNormShape (attrs=..., in_shape=0x555556578ff8, out_shape=0x555556579010) at src/operator/nn/layer_norm.cc:50
50	  const index_t channelCount = dshape[axis]; <========
(gdb) n
52	  if (!mxnet::ndim_is_known(dshape)) {
(gdb) info local
param = @0x7fffffff9438: {<dmlc::Parameter<mxnet::op::LayerNormParam>> = {<No data fields>}, axis = 0, eps = 9.99999975e-06, output_mean_var = false}
dshape = @0x5555565bc420: {<mxnet::Tuple<long>> = {static kStackCache = <optimized out>, ndim_ = 1, num_heap_allocated_ = 0, data_stack_ = {4300000000, 6878235116697514089, 32088647312828786, 0},
    data_heap_ = 0x0}, <No data fields>}
axis = 0
channelCount = 4300000000 <--------
moments_shape = {<mxnet::Tuple<long>> = {static kStackCache = <optimized out>, ndim_ = -29480, num_heap_allocated_ = 32767, data_stack_ = {140737488326512, 140737488325408, 93825021150800, 140737488325408},
    data_heap_ = 0x7fff936c4de7
     <std::_Rb_tree<dmlc::parameter::FieldAccessEntry*, dmlc::parameter::FieldAccessEntry*, std::_Identity<dmlc::parameter::FieldAccessEntry*>, std::less<dmlc::parameter::FieldAccessEntry*>, std::allocator<dmlc::parameter::FieldAccessEntry*> >::_Alloc_node::operator()<dmlc::parameter::FieldAccessEntry* const&>(dmlc::parameter::FieldAccessEntry* const&) const+49>}, <No data fields>}
(gdb) p dshape[axis]
$1 = (long &) @0x5555565bc428: 4300000000 <--------

Testing

$ MXNET_TEST_COUNT=1 nosetests --logging-level=DEBUG --verbose -s tests/nightly/test_large_vector.py:test_nn
/home/ubuntu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
test_large_vector.test_nn ... [18:14:51] src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[18:21:14] src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
ok

----------------------------------------------------------------------
Ran 1 test in 1017.457s

OK

@access2rohit
Copy link
Contributor Author

access2rohit commented Mar 10, 2020

@apeforest @ChaiBapchya I don't know much about layer_norm() or batch_norm() to add suitable shape checks in the tests. I have provided gdb outputs after fixing the code. Can you guys suggest addition of proper shape testing that can be added to test_large_vector and test_large_array?

@access2rohit access2rohit changed the title fixing batch_norm and layer_norm for large tensors fixing batch_norm and layer_norm for large tensor nightly test Mar 10, 2020
@access2rohit
Copy link
Contributor Author

@mxnet-label-bot add [pr-awaiting-review]

@lanking520 lanking520 added the pr-awaiting-review PR is waiting for code review label Mar 10, 2020
@ChaiBapchya
Copy link
Contributor

  1. How is addition of SHAPE_ASSIGN_CHECK to layer_norm causing this failure?
    Layer norm/batch norm were passing before and some change caused it to start to fail right? What's that root cause?

  2. Also it turns out - batch norm already has shape check in test_large_array.py
    https://github.com/apache/incubator-mxnet/blob/afb8742e6e1e987833b39c487dc892b5537196a1/tests/nightly/test_large_array.py#L327

Layer norm doesn't have such a check in test_large_array.py. Maybe you could add that.

Fundamentally, For both batch norm and layer norm, since the operation is just to perform normalization over layer/batch, input shape should be equal to output shape.

@access2rohit
Copy link
Contributor Author

access2rohit commented Mar 10, 2020

@ChaiBapchya

1. How is addition of SHAPE_ASSIGN_CHECK to layer_norm causing this failure?
   Layer norm/batch norm were passing before and some change caused it to start to fail right? What's that root cause?

it was incorrect when added check my GDB logs

2. Also it turns out - batch norm already has shape check in test_large_array.py
   https://github.com/apache/incubator-mxnet/blob/afb8742e6e1e987833b39c487dc892b5537196a1/tests/nightly/test_large_array.py#L327

Its still incorrect.

Layer norm doesn't have such a check in test_large_array.py. Maybe you could add that.

Actually its better to add the check added in this PR #17683
Currently I don't have cycles to work on this. I have asked @sxjscience to see if he can add this check. Since I would be occupied for next 2 weeks.

@apeforest
Copy link
Contributor

apeforest commented Mar 10, 2020

It's very unlikely the number of channels will be greater than 2^31. So this should not cause problem in practice. @sxjscience please confirm.

@access2rohit I don't fully understand the gdb output in your description. They seem to stop at different places and what do you want us to see?

@access2rohit
Copy link
Contributor Author

@mxnet-label-bot update [pr-awaiting-merge]

@lanking520 lanking520 added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Mar 16, 2020
@apeforest apeforest merged commit 66b21b5 into apache:master Mar 16, 2020
MoisesHer pushed a commit to MoisesHer/incubator-mxnet that referenced this pull request Apr 10, 2020
ChaiBapchya added a commit to ChaiBapchya/mxnet that referenced this pull request May 8, 2020
TaoLv pushed a commit that referenced this pull request May 11, 2020
Co-authored-by: Rohit Kumar Srivastava <[email protected]>

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020
ChaiBapchya pushed a commit to ChaiBapchya/mxnet that referenced this pull request Sep 20, 2020
@ChaiBapchya
Copy link
Contributor

This needs to be cherry-picked into v1.x
Doing it now

samskalicky pushed a commit that referenced this pull request Sep 24, 2020
* fixing batch_norm and layer_norm for large tensors (#17805)

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* Fix nightly large_vector test caused by incorrect with_seed path (#18178)

* add back the missing environment function

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
Co-authored-by: Rohit Kumar Srivastava <[email protected]>
ChaiBapchya added a commit to ChaiBapchya/mxnet that referenced this pull request Sep 24, 2020
* fixing batch_norm and layer_norm for large tensors (apache#17805)

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* Fix nightly large_vector test caused by incorrect with_seed path (apache#18178)

* add back the missing environment function

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
Co-authored-by: Rohit Kumar Srivastava <[email protected]>
samskalicky pushed a commit that referenced this pull request Sep 24, 2020
* fixing batch_norm and layer_norm for large tensors (#17805)

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* Fix nightly large_vector test caused by incorrect with_seed path (#18178)

* add back the missing environment function

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
Co-authored-by: Rohit Kumar Srivastava <[email protected]>

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
Co-authored-by: Rohit Kumar Srivastava <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants