train_imagenet.py fails with float16 + alexnet #16239

johnbroughton2017 · 2019-09-22T21:58:45Z

johnbroughton2017
Sep 22, 2019

Description

I found the following error when training an alexnet using fp16. It is very easy reproduce using the command below. I have seen a few people reporting this bug yet I haven't been able to find a solution. If you replace alexnet with reset-v1, it works fine.

I was following the official tutorial here:
https://mxnet.incubator.apache.org/api/faq/float16

I think the root of the error is this line in metric.py:
pred_label.asnumpy()

Environment

I am using the latest 1.5.0 mxnet and cuda 10.1. I am using the Symbolic APIs.

Error Message:

  File "/home/xinghua/anaconda2/lib/python2.7/site-packages/mxnet/metric.py", line 350, in update_dict
    metric.update_dict(labels, preds)
  File "/home/xinghua/anaconda2/lib/python2.7/site-packages/mxnet/metric.py", line 133, in update_dict
    self.update(label, pred)
  File "/home/xinghua/anaconda2/lib/python2.7/site-packages/mxnet/metric.py", line 501, in update
    pred_label = pred_label.asnumpy().astype('int32')
  File "/home/xinghua/anaconda2/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 1996, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/xinghua/anaconda2/lib/python2.7/site-packages/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [15:40:12] include/mxnet/././tensor_blob.h:236: Check failed: mshadow::DataType<DType>::kFlag == type_flag_: TBlob.get_with_shape: data type do not match specified type.Expected: 2 v.s. given 0
Stack trace:
  [bt] (0) /home/xinghua/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x4a3b8b) [0x7f613fbf0b8b]
  [bt] (1) /home/xinghua/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x304366d) [0x7f614279066d]
  [bt] (2) /home/xinghua/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x3553e98) [0x7f6142ca0e98]
  [bt] (3) /home/xinghua/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2675018) [0x7f6141dc2018]
  [bt] (4) /home/xinghua/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x267abf5) [0x7f6141dc7bf5]
  [bt] (5) /home/xinghua/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x265a8a1) [0x7f6141da78a1]
  [bt] (6) /home/xinghua/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x265ddb0) [0x7f6141daadb0]
  [bt] (7) /home/xinghua/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x265e046) [0x7f6141dab046]
  [bt] (8) /home/xinghua/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2659004) [0x7f6141da6004]

Minimum reproducible example

python train_imagenet.py --network alexnet --benchmark 1 --gpus 0 --batch-size 64 --dtype float16

Answered by johnbroughton2017

Sep 26, 2019

@zhreshold @szha
Thanks for looking into this. Seems like a real issue after upgrading to CUDA 10.1.

I am wondering if this is related to mx.symbol.LRN. I tried to remove LRNs in the AlexNet and the error disappears. FYI.

View full answer

mxnet-label-bot · 2019-09-22T21:58:49Z

mxnet-label-bot
Sep 22, 2019

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended label(s): Bug

0 replies

lanking520 · 2019-09-24T20:25:25Z

lanking520
Sep 24, 2019
Collaborator

@zhreshold @hetong007 Could you please take a look? It looks like a GluonCV problem.

0 replies

zhreshold · 2019-09-24T22:36:00Z

zhreshold
Sep 24, 2019
Collaborator

@johnbroughton2017 Double checking that you are using the symbolic example here: https://github.com/apache/incubator-mxnet/tree/master/example/image-classification

I just tried it using mxnet-cu100 1.5.0 and it's working fine.

There are a couple reasons it might fail

Are you using mxnet built with cuda?
Are you using a GPU that supports fp16? GPUs older than Volta arch can't support fp16.

0 replies

johnbroughton2017 · 2019-09-25T02:38:56Z

johnbroughton2017
Sep 25, 2019
Author

@zhreshold Thanks a lot for getting back to me.

The GPU I am using is GeForce RTX 2080
The mxnet version I am using is mxnet-cu101mkl
I just tried mxnet-cu101 which gave the same error.

Could this be a problem between 10.0 CUDA and 10.1 CUDA?

0 replies

zhreshold · 2019-09-25T21:22:32Z

zhreshold
Sep 25, 2019
Collaborator

Yep, I can reproduce the error using CUDA 10.1.
@szha do you know who's familiar with it?

0 replies

johnbroughton2017 · 2019-09-26T23:43:11Z

johnbroughton2017
Sep 26, 2019
Author

@zhreshold @szha
Thanks for looking into this. Seems like a real issue after upgrading to CUDA 10.1.

I am wondering if this is related to mx.symbol.LRN. I tried to remove LRNs in the AlexNet and the error disappears. FYI.

0 replies

samskalicky · 2019-10-07T19:02:34Z

samskalicky
Oct 7, 2019
Collaborator

@zachgk assign [@szha ]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_imagenet.py fails with float16 + alexnet #16239

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

train_imagenet.py fails with float16 + alexnet #16239

johnbroughton2017 Sep 22, 2019

Description

Environment

Error Message:

Minimum reproducible example

Replies: 7 comments

mxnet-label-bot Sep 22, 2019

lanking520 Sep 24, 2019 Collaborator

zhreshold Sep 24, 2019 Collaborator

johnbroughton2017 Sep 25, 2019 Author

zhreshold Sep 25, 2019 Collaborator

johnbroughton2017 Sep 26, 2019 Author

samskalicky Oct 7, 2019 Collaborator

johnbroughton2017
Sep 22, 2019

mxnet-label-bot
Sep 22, 2019

lanking520
Sep 24, 2019
Collaborator

zhreshold
Sep 24, 2019
Collaborator

johnbroughton2017
Sep 25, 2019
Author

zhreshold
Sep 25, 2019
Collaborator

johnbroughton2017
Sep 26, 2019
Author

samskalicky
Oct 7, 2019
Collaborator