Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fused Op causes MXNetError #16747

Closed
leezu opened this issue Nov 7, 2019 · 12 comments · Fixed by #16781
Closed

Fused Op causes MXNetError #16747

leezu opened this issue Nov 7, 2019 · 12 comments · Fixed by #16781
Assignees
Labels

Comments

@leezu
Copy link
Contributor

leezu commented Nov 7, 2019

Description

After #15167 is merged, GluonNLP CI broke.

Error Message

[2019-11-06T06:44:48.223Z] mxnet.base.MXNetError: Error in operator Embedding_Dropout_Embedding_Dropout__FusedOp__contrib_arange_like__FusedOp_broadcast_lesser__FusedOp_broadcast_mul__FusedOp_broadcast_mul_expand_dims_broadcast_axis_Embedding__FusedOp_broadcast_add_Dropout_amp_cast_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected_Reshape_transpose_Reshape__contrib_div_sqrt_dim_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot__FusedOp_where_amp_cast_softmax__FusedOp_Dropout_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot_Reshape_transpose__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected_Reshape_transpose_Reshape__contrib_div_sqrt_dim_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot__FusedOp_where_amp_cast_softmax__FusedOp_Dropout_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot_Reshape_transpose__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected_Reshape_transpose_Reshape__contrib_div_sqrt_dim_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot__FusedOp_where_amp_cast_softmax__FusedOp_Dropout_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot_Reshape_transpose__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected_Reshape_transpose_Reshape__contrib_div_sqrt_dim_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot__FusedOp_where_amp_cast_softmax__FusedOp_Dropout_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot_Reshape_transpose__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected_Reshape_transpose_Reshape__contrib_div_sqrt_dim_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot__FusedOp_where_amp_cast_softmax__FusedOp_Dropout_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot_Reshape_transpose__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected_Reshape_transpose_Reshape__contrib_div_sqrt_dim_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot__FusedOp_where_amp_cast_softmax__FusedOp_Dropout_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot_Reshape_transpose__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected_Reshape_transpose_Reshape__contrib_div_sqrt_dim_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot__FusedOp_where_amp_cast_softmax__FusedOp_Dropout_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot_Reshape_transpose__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected_Reshape_transpose_Reshape__contrib_div_sqrt_dim_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot__FusedOp_where_amp_cast_softmax__FusedOp_Dropout_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot_Reshape_transpose__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected_Reshape_transpose_Reshape__contrib_div_sqrt_dim_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot__FusedOp_where_amp_cast_softmax__FusedOp_Dropout_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot_Reshape_transpose__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected_Reshape_transpose_Reshape__contrib_div_sqrt_dim_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot__FusedOp_where_amp_cast_softmax__FusedOp_Dropout_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot_Reshape_transpose__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected_Reshape_transpose_Reshape__contrib_div_sqrt_dim_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot__FusedOp_where_amp_cast_softmax__FusedOp_Dropout_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot_Reshape_transpose__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected_Reshape_transpose_Reshape__contrib_div_sqrt_dim_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot__FusedOp_where_zeros_like__FusedOpHelper_amp_cast_softmax__FusedOp_Dropout_amp_cast_amp_cast_FullyConnected_Reshape_transpose__FusedOp_batch_dot_Reshape_transpose__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_amp_cast_amp_cast_amp_cast_FullyConnected__FusedOp_amp_cast_amp_cast_FullyConnected_Dropout__FusedOp_amp_cast_amp_cast_LayerNorm_SequenceMask__FusedOp_amp_cast_amp_cast_FullyConnected_Activation_Dropout_amp_cast_amp_cast_amp_cast_FullyConnected__backward_FullyConnected__backward_amp_cast__backward_Dropout__backward_Activation__backward_FullyConnected__FusedOp__FusedOpHelper__backward_slice__backward_SequenceMask__backward_LayerNorm__FusedOp__backward_Dropout__backward_FullyConnected__FusedOp__backward_FullyConnected__FusedOp__backward_LayerNorm__FusedOp__backward_Dropout__backward_FullyConnected__FusedOpHelper__backward_amp_cast__FusedOpHelper__backward_reshape_transpose__backward_reshape_batch_dot_batch_dot__FusedOp__backward_Dropout__FusedOpHelper__backward_mul__FusedOpHelper__backward_amp_multicast__backward_amp_multicast: _Map_base::at

To Reproduce

git clone https://github.com/dmlc/gluon-nlp; cd gluon-nlp; pytest --color=yes -s scripts -k 'test_finetune_train[float16-WNLI-bert_12_768_12-2]'

Environment

https://pypi.org/project/mxnet-cu100/1.6.0b20191102/

@leezu leezu added the Bug label Nov 7, 2019
@leezu
Copy link
Contributor Author

leezu commented Nov 7, 2019

@ptrendx

@sxjscience
Copy link
Member

I suggest turn the fused_op off by default in the 1.6.0 release and announce it as experimental feature, or revert the PR. @szha @eric-haibin-lin @junrushao1994 @DickJC123 @wkcn @reminisce @haojin2 @TaoLv @marcoabreu What do you think?

@sxjscience
Copy link
Member

@zhreshold

@wkcn
Copy link
Member

wkcn commented Nov 7, 2019

I agree to turn the fused_op off by default until fused_op is stable.
The reason is that users couldn't use the 1.6.0 release if it is not compatible with their code.

@TaoLv
Copy link
Member

TaoLv commented Nov 7, 2019

+1

@ptrendx
Copy link
Member

ptrendx commented Nov 7, 2019

Isn't right now the period of finding those integration bugs and fixing them for 1.6 release? I will definitely look into this issue and fix it, not sure why you propose to turn the feature off by default?

@sxjscience
Copy link
Member

@ptrendx I think we are already in a code-freeze status and the simplest fix is to turn it off by default. We could easily turn it on in 1.6.1 once we have confirmed that it has no impact in all the training scripts (there are plenty of them) and some may take time to run.

@ptrendx ptrendx self-assigned this Nov 7, 2019
@ptrendx
Copy link
Member

ptrendx commented Nov 7, 2019

Ok, I sent a clarification email to dev@ as you are not actually the first person to reach out to me with this misunderstanding of code freeze. Code freeze is a period where bugs are found and fixed in order to polish the release and provide the best experience for the end users.

I treat the bugs about fusion with highest priority and will do my best to fix them. If I fail to address all issues before the time to make RC, then I agree it should be turned off by default and marked experimental.

@leezu
Copy link
Contributor Author

leezu commented Nov 7, 2019

I agree with @ptrendx, we should try to fix the bugs and ship the features if time allows.

@sxjscience
Copy link
Member

I received the clarification email about the meaning of code freeze and I agree with @ptrendx that we should try to fix it these days and consider to turn it off by default if we fail to do so. BTW, what's the expected date for 1.6 RC?

@ptrendx ptrendx mentioned this issue Nov 12, 2019
2 tasks
@ptrendx
Copy link
Member

ptrendx commented Nov 12, 2019

I created a PR with a fix. @leezu, could you validate it?

@leezu
Copy link
Contributor Author

leezu commented Nov 13, 2019

@ptrendx thanks for the fix. Just confirmed it works.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants