train_imagenet.py fails with float16 + alexnet #16239
-
DescriptionI found the following error when training an alexnet using fp16. It is very easy reproduce using the command below. I have seen a few people reporting this bug yet I haven't been able to find a solution. If you replace I was following the official tutorial here: I think the root of the error is this line in metric.py: EnvironmentI am using the latest 1.5.0 mxnet and cuda 10.1. I am using the Symbolic APIs. Error Message:
Minimum reproducible example
|
Beta Was this translation helpful? Give feedback.
Replies: 7 comments
-
Hey, this is the MXNet Label Bot. |
Beta Was this translation helpful? Give feedback.
-
@zhreshold @hetong007 Could you please take a look? It looks like a GluonCV problem. |
Beta Was this translation helpful? Give feedback.
-
@johnbroughton2017 Double checking that you are using the symbolic example here: https://github.com/apache/incubator-mxnet/tree/master/example/image-classification I just tried it using mxnet-cu100 1.5.0 and it's working fine. There are a couple reasons it might fail
|
Beta Was this translation helpful? Give feedback.
-
@zhreshold Thanks a lot for getting back to me. The GPU I am using is GeForce RTX 2080 Could this be a problem between 10.0 CUDA and 10.1 CUDA? |
Beta Was this translation helpful? Give feedback.
-
Yep, I can reproduce the error using CUDA 10.1. |
Beta Was this translation helpful? Give feedback.
-
@zhreshold @szha I am wondering if this is related to mx.symbol.LRN. I tried to remove LRNs in the AlexNet and the error disappears. FYI. |
Beta Was this translation helpful? Give feedback.
@zhreshold @szha
Thanks for looking into this. Seems like a real issue after upgrading to CUDA 10.1.
I am wondering if this is related to mx.symbol.LRN. I tried to remove LRNs in the AlexNet and the error disappears. FYI.