-
Notifications
You must be signed in to change notification settings - Fork 6.8k
topk regression in v1.5 #15703
Comments
Hey, this is the MXNet Label Bot. |
@leezu Thank you for reporting this |
I did a traverse through the nightly build and commits in between 1.4.0 and 1.5.0 release and found that the regression is most likely caused by: #14570 @apeforest could you help to double check? |
@leezu Please propose if we need fix this issue in the 1.5.1 patch release. Thanks. |
I thought that feature would not be enabled if the compilation flag isn't turned on. what's the cause for builds without that flag? |
I'm not sure. I just located the issue to this commit by bisecting. The compilation flag is not set when I built mxnet from source. |
@TaoLv the issue is not blocking me, but I find it concerning that #14570 should have had no side-effects when the compilation flag is unset but introduces this issue. It should be fixed on a 1.5.x point release. |
Sorry, I just saw this. Looking into it now. |
Interestingly, I found that turning on the USE_INT64_TENSOR_SIZE flag (meaning using int64_t instead of int32_t as index_t type) will solve the OOM issue. Still rootcausing it. |
Further narrowed it down to topk operator. There is some implementation of TopKImpl that did not allocate correct amount of GPU memory. Working on a PR now. |
Root cause found: it is due to this line: https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/ordering_op-inl.h#L434 mshadow::Shape is constructed using index_t, which by default is int32_t in MXNet 1.5. In this case, the workspace size is 3184736511 which exceeds 2^31 and hence causing integer overflow. Workaround: turn on the USE_INT64_TENSOR_SIZE compiler flag Possible Fix:
Lin |
@leezu Based on the analysis above, this is not a really memory usage regression but a bug due to integer overflow. The memory space required by the topk operator in your script is 2729810175 which exceeds 2^31 (max int32_t). It did not overflow in MXNet 1.4 because int64_t was used by default as the type for index_t. Therefore, this is another case where large integer support is needed in MXNet. Given that we plan to turn on USE_INT64_TENSOR_SIZE flag in MXNet 1.6 by default, would you use the workaround by turning on the compiler flag manually and building mxnet from source? Please let me know if this solution is acceptable before MXNet 1.6 release. Thanks! |
Thank you for diving deep to find the root cause! I'm not blocked by this fix having to wait for MXNet 1.6, but we may wan't ask @TaoLv as release manager. If the fix has to wait for 1.6 the regression should be documented, for example in the "known issues" section of the release notes? https://github.com/apache/incubator-mxnet/releases What do you mean with "partially fixes" in #15948? Could that fix in principle be cherry-picked for 1.5.X? |
@apeforest Thank you for the analysis. What's the blocker to get this issue fixed on the v1.5.x branch? |
@TaoLv This is not an issue (bug per se) but limitation of int32_t data types we used in MXNet. As I pointed to the line https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/ordering_op-inl.h#L434 the workspace is created using a 1D mshadow::Shape object, whose length is bounded by @leezu #15948 is a partial fix because it only fixed the memory misalignment but not the OOM caused by int overflow. To really fix this issue, we need to support int64_t in mxnet by default. |
@apeforest It seems int32_t was not used for the topk operator prior to v1.5. Thus as v1.5 changed the defaults for topk, it created this regression. This needs to be documented, given that most users will not be aware of the work done on USE_INT64_TENSOR_SIZE flag and why it had to introduce int32_t for topk. |
@leezu int64_t was used by default in v1.4. However, we identified performance regression (#14496, #14790) due to the change of data type and therefore introduced a compiler flag to switch the default data type back to int32_t in v1.5. We are working on performance optimization with int64_t and plan to turn it back on to default in v1.6 |
Description
https://github.com/dmlc/gluon-nlp/blob/v0.7.1/scripts/word_embeddings/evaluate_pretrained.py stopped working with MXNet v1.5 due to out of memory errors. While the script runs fine on 16GB GPU memory (p3.2xlarge) with MXNet v1.4 it runs out of GPU memory even with 32GPU memory (p3dn.24xlarge) with MXNet v1.5.
Environment info (Required)
Minimum reproducible example
Alternatively
python3 ./scripts/word_embeddings/evaluate_pretrained.py --embedding-name fasttext --embedding-source wiki.simple --gpu 0 --similarity-datasets --eval-batch-size 1024
The text was updated successfully, but these errors were encountered: