Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

batch_dot operator crash #20301

Closed
matteosal opened this issue May 24, 2021 · 33 comments
Closed

batch_dot operator crash #20301

matteosal opened this issue May 24, 2021 · 33 comments

Comments

@matteosal
Copy link
Contributor

batch_dot seems completely broken.

import mxnet as mx

sym = mx.sym.batch_dot(mx.sym.Variable('in1'), mx.sym.Variable('in2'))

ex = sym._bind(
	mx.cpu(), 
	{'in1': mx.nd.ones((2, 3, 4)), 'in2': mx.nd.ones((2, 4, 5))}
)
ex.forward()

Running this script produces:

[14:24:17] /home/matteo/Git/mxnet-build/Build/Linux-x86-64/MKL/mxnet/src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU

Fatal Error: Segmentation fault

Fatal Error: Segmentation fault

Fatal Error: Segmentation fault

Fatal Error: Segmentation fault

Fatal Error: Segmentation fault

Fatal Error: Segmentation fault

Fatal Error: Segmentation fault
Stack trace:
Stack trace:
Segmentation fault (core dumped)

Maybe the problem is in my build. I'm building master from source with these settings (linking to MKL 2019.4):

 `# GENERAL FLAGS` \
 -DCMAKE_INSTALL_PREFIX=$output_dir \
 -DCMAKE_BUILD_TYPE=Release \
 -DCMAKE_SKIP_BUILD_RPATH=On \
 -DUSE_OPENCV=OFF \
 -DUSE_F16C=Off `# float16 support`\
 -DUSE_INT64_TENSOR_SIZE=OFF \
 -DCMAKE_C_FLAGS_RELEASE="-DNDEBUG" \
 -DCMAKE_CXX_FLAGS_RELEASE="-DNDEBUG" \
 `# MATH BACKENDS` \
 -DBLAS=MKL \
 -DUSE_LAPACK=OFF \
 -DUSE_ONEDNN=OFF \
 -DBLA_VENDOR="Intel10_64ilp" \
 -DBLA_STATIC=OFF \
 -DMKL_USE_SINGLE_DYNAMIC_LIBRARY=OFF \
 -DMKL_INCLUDE_DIR=$mkl_dir \
 -DBLAS_LIBRARIES="$mkl_dir/libmkl_def.so;$mkl_dir/libmkl_intel_ilp64.so;$mkl_dir/libmkl_core.so;$mkl_dir/libmkl_intel_thread.so;$mkl_dir/libiomp5.so" \
 `# OPENMP` \
 -DUSE_OPENMP=ON \
 -DOpenMP_C_FLAGS="-I$mkl_dir" \
 -DOpenMP_C_LIB_NAMES="libiomp5" \
 -DOpenMP_CXX_FLAGS="-I$mkl_dir" \
 -DOpenMP_CXX_LIB_NAMES="libiomp5" \
 -DOpenMP_libiomp5_LIBRARY="$mkl_dir/libiomp5.so" \
 `# CUDA` \
 -DUSE_CUDA=OFF \
@anko-intel
Copy link
Contributor

@bartekkuncer please take a look

@matteosal
Copy link
Contributor Author

Any news on this?

@bartekkuncer
Copy link
Contributor

bartekkuncer commented Jun 24, 2021

Hello @matteosal , sorry for the late response. I tried to reproduce the issue on master branch but had no success. Which branch are you working on?

@matteosal
Copy link
Contributor Author

I have updated to the latest master and still see the crash. Are you using the same build settings I reported?

@bartekkuncer
Copy link
Contributor

@matteosal almost. The only difference is I used newer version of mkl. Will try with yours.

@bartekkuncer
Copy link
Contributor

bartekkuncer commented Jun 29, 2021

@matteosal I tried to reproduce your bug using your exact build config but was unable to. I tried your version of MKL and also a newer and older one - all worked without any problems :(

Please try running your test with MKL_VERBOSE flag set e.g. MKL_VERBOSE=1 test.py and paste the output in the comment. Please also provide me with the information about what exact OS are you using, wherefrom/how you got your mkl library and which compiler are you using.

@matteosal
Copy link
Contributor Author

matteosal commented Jun 29, 2021

Setting MKL_VERBOSE=1 doesn't change anything.

I have also tried to build with mkl 2020.1 and got the same crash + some symbol lookup issue:

[16:43:33] /home/matteo/Git/mxnet-build/Build/Linux-x86-64/MKL/mxnet/src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU

Fatal Error: Segmentation fault

Fatal Error: Segmentation fault

Fatal Error: Segmentation fault

Fatal Error: Segmentation fault

Fatal Error: Segmentation fault

Fatal Error: Segmentation fault

Fatal Error: Segmentation fault
INTEL MKL ERROR: /home/matteo/Git/mxnet-build/Build/Linux-x86-64/MKL/mkl/libmkl_avx2.so: undefined symbol: mkl_sparse_optimize_bsr_trsm_i8.
Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.
Stack trace:
Segmentation fault (core dumped)

Which makes me think that I'm doing something wrong for this one.

But: since MKL_VERBOSE=1 doesn't change anything and my symbol lookup messages show up after the seg faults, maybe the problem is encountered before reaching MKL?

@bartekkuncer
Copy link
Contributor

What OS and compiler are you using?

@matteosal
Copy link
Contributor Author

Linux/gcc

@bgawrych
Copy link
Contributor

@matteosal Have you tried using

export LD_PRELOAD=$mkl_dir/libmkl_def.so;$mkl_dir/libmkl_intel_ilp64.so;$mkl_dir/libmkl_core.so;$mkl_dir/libmkl_intel_thread.so;$mkl_dir/libiomp5.so

?
(Use proper $mkl_dir path)

Also can you point exact OS version?

@matteosal
Copy link
Contributor Author

@bgawrych I rebuilt everything from scratch with both mkl 2019.4 and 2020.1 and now I see the same exact behaviour (segfault messages without symbol lookup errors)

Full versions are Ubuntu 20.04 + gcc 9.3.0

@bgawrych
Copy link
Contributor

bgawrych commented Jul 1, 2021

@matteosal Are you sure that your $mkl_dir path is proper one? In my environment I have
-DMKL_INCLUDE_DIR=/home/bg/miniconda3/envs/mxnet/include
-DBLAS_LIBRARIES="/home/bg/miniconda3/envs/mxnet/lib/libmkl_def.so;
    /home/bg/miniconda3/envs/mxnet/lib/libmkl_intel_ilp64.so;
    /home/bg/miniconda3/envs/mxnet/lib/libmkl_core.so;
    /home/bg/miniconda3/envs/mxnet/lib/libmkl_intel_thread.so;
    /home/bg/miniconda3/envs/mxnet/lib/libiomp5.so"

Notice that include_dir have different path than BLAS_LIBRARIES lib vs include

@matteosal
Copy link
Contributor Author

matteosal commented Jul 1, 2021

@matteosal Are you sure that your $mkl_dir path is proper one? In my environment I have
-DMKL_INCLUDE_DIR=/home/bg/miniconda3/envs/mxnet/include
-DBLAS_LIBRARIES="/home/bg/miniconda3/envs/mxnet/lib/libmkl_def.so;
    /home/bg/miniconda3/envs/mxnet/lib/libmkl_intel_ilp64.so;
    /home/bg/miniconda3/envs/mxnet/lib/libmkl_core.so;
    /home/bg/miniconda3/envs/mxnet/lib/libmkl_intel_thread.so;
    /home/bg/miniconda3/envs/mxnet/lib/libiomp5.so"

Notice that include_dir have different path than BLAS_LIBRARIES lib vs include

Yes that's because for some reason our reference internal MKL checkout has a non-standard file layout, all includes and libraries are dumped into the same folder ($mkl_dir in my script).
If there was a problem with this it would have failed at build time, and we have been building and intensively using mxnet with this setup for years. Also I don't see any symbol lookup errors anymore, so I'd rule out that something is wrong with the MKL path.

@bgawrych can you suggest another python which uses MKL primitives that I can try out? It should give us more information

@matteosal
Copy link
Contributor Author

I have tried to set -DUSE_ONEDNN=ON and the example doesn't crash anymore.

Also I've realized that without -DUSE_ONEDNN=ON the symbol lookup error is inconsistent and independent of any setting: running the example multiple times sometimes prints the message, sometimes not (but always crashes with the segfault).

@bgawrych
Copy link
Contributor

bgawrych commented Jul 1, 2021

@matteosal use MKL_VERBOSE=1 when you're running your reproduction -
I'm suspecting that something is wrong with your mkl installation. Can you try using mkl from conda env?
conda install mkl mkl-inlcude mkl-service -c intel

INTEL MKL ERROR: /home/matteo/Git/mxnet-build/Build/Linux-x86-64/MKL/mkl/libmkl_avx2.so: undefined symbol: mkl_sparse_optimize_bsr_trsm_i8.
Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.
Stack trace:
Segmentation fault (core dumped)

Other operator which uses MKL is for sure LayerNorm, but MXNet must be built with MXNET_USE_MKL_LAYERNORM=1 flag

@bgawrych
Copy link
Contributor

bgawrych commented Jul 1, 2021

I have tried to set -DUSE_ONEDNN=ON and the example doesn't crash anymore.

Also I've realized that without -DUSE_ONEDNN=ON the symbol lookup error is inconsistent and independent of any setting: running the example multiple times sometimes prints the message, sometimes not (but always crashes with the segfault).

Few days ago @bartekkuncer added support for oneDNN batchdot in MXNet, so with oneDNN enabled MKL is not used

@matteosal
Copy link
Contributor Author

@matteosal use MKL_VERBOSE=1 when you're running your reproduction -
I'm suspecting that something is wrong with your mkl installation. Can you try using mkl from conda env?
conda install mkl mkl-inlcude mkl-service -c intel

I've rebuilt linking to this MKL (which should be version 2021.2) and still get the same crash.

I have also tried building with -DUSE_MKL_LAYERNORM=ON (and my version of MKL) and this script doesn't crash:

import mxnet as mx

sym = mx.sym.LayerNorm(mx.sym.Variable('data'), mx.sym.Variable('gamma'), mx.sym.Variable('beta'))

ex = sym._bind(
    mx.cpu(), 
    {'data': mx.nd.ones((2, 3)), 'gamma': mx.nd.ones((3)), 'beta': mx.nd.ones((3))}
)
ex.forward()

print('done')

@szha
Copy link
Member

szha commented Jul 1, 2021

Did both of you update submodules? git submodule update --init --recursive

@matteosal
Copy link
Contributor Author

Did both of you update submodules? git submodule update --init --recursive

Yes, just rebuilt again after this to double check

@szha
Copy link
Member

szha commented Jul 1, 2021

Might be a good idea for both of you to run this script to report the instruction sets supported too.

@matteosal
Copy link
Contributor Author

Might be a good idea for both of you to run this script to report the instruction sets supported too.

Here is the result:

----------Python Info----------
Version      : 3.8.10
Compiler     : GCC 9.4.0
Build        : ('default', 'Jun  2 2021 10:49:15')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 20.0.2
Directory    : /usr/lib/python3/dist-packages/pip
----------MXNet Info-----------
Version      : 2.0.0
Directory    : /home/matteo/Git/mxnet/python/mxnet
Commit hash file "/home/matteo/Git/mxnet/python/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
Library      : ['/home/matteo/Git/mxnet/python/mxnet/../../lib/libmxnet.so']
Build features:
✖ CUDA
✖ CUDNN
✖ NCCL
✖ TENSORRT
✖ CUTENSOR
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✖ CPU_SSE4_1
✖ CPU_SSE4_2
✖ CPU_SSE4A
✖ CPU_AVX
✖ CPU_AVX2
✔ OPENMP
✖ SSE
✖ F16C
✖ JEMALLOC
✖ BLAS_OPEN
✖ BLAS_ATLAS
✔ BLAS_MKL
✖ BLAS_APPLE
✖ LAPACK
✖ ONEDNN
✖ OPENCV
✖ DIST_KVSTORE
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
✖ TVM_OP
----------System Info----------
Platform     : Linux-5.8.0-59-generic-x86_64-with-glibc2.29
system       : Linux
node         : pajarulo
release      : 5.8.0-59-generic
version      : #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           158
Model name:                      Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Stepping:                        13
CPU MHz:                         2574.861
CPU max MHz:                     5000,0000
CPU min MHz:                     800,0000
BogoMIPS:                        4800.00
Virtualization:                  VT-x
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        2 MiB
L3 cache:                        16 MiB
NUMA node0 CPU(s):               0-15
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Mitigation; TSX disabled
Vulnerability Tsx async abort:   Mitigation; TSX disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat ps
                                 e36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1
                                 gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xt
                                 opology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds
                                 _cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2
                                 apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_
                                 lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb s
                                 tibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgs
                                 base tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap
                                  clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat 
                                 pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arc
                                 h_capabilities
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0008 sec, LOAD: 0.5831 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1331 sec, LOAD: 0.4816 sec.
Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1131)>, DNS finished in 0.08226609230041504 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0319 sec, LOAD: 0.8614 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0160 sec, LOAD: 0.5705 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.11925387382507324 sec.
----------Environment----------

@bgawrych
Copy link
Contributor

bgawrych commented Jul 2, 2021

@matteosal I've got reproduction and will try to figure out root cause - is using oneDNN sufficient as workaround right now?

@bgawrych
Copy link
Contributor

bgawrych commented Jul 2, 2021

@matteosal These was my reproduction steps

conda create -n dotmkl3 python=3.7.7
conda activate dotmkl3
conda install ninja cmake
conda install mkl mkl-include mkl-service -c intel
cd build
cmake -GNinja \
 `# GENERAL FLAGS` \
 -DCMAKE_BUILD_TYPE=Release \
 -DCMAKE_SKIP_BUILD_RPATH=On \
 -DUSE_OPENCV=OFF \
 -DUSE_F16C=Off `# float16 support`\
 -DUSE_INT64_TENSOR_SIZE=OFF \
 -DCMAKE_C_FLAGS_RELEASE="-DNDEBUG" \
 -DCMAKE_CXX_FLAGS_RELEASE="-DNDEBUG" \
 `# MATH BACKENDS` \
 -DBLAS=MKL \
 -DUSE_LAPACK=OFF \
 -DUSE_ONEDNN=OFF \
 -DBLA_VENDOR="Intel10_64ilp" \
 -DBLA_STATIC=OFF \
 -DMKL_USE_SINGLE_DYNAMIC_LIBRARY=OFF \
 -DMKL_INCLUDE_DIR=/home/bg/anaconda3/envs/dotmkl3/include \
 -DBLAS_LIBRARIES="/home/bg/anaconda3/envs/dotmkl3/lib/libmkl_intel_ilp64.so;/home/bg/anaconda3/envs/dotmkl3/lib/libmkl_core.so;/home/bg/anaconda3/envs/dotmkl3/lib/libmkl_intel_thread.so;/home/bg/anaconda3/envs/dotmkl3/lib/libiomp5.so" \
 -DUSE_OPENMP=ON \
 -DOpenMP_C_FLAGS="-I/home/bg/anaconda3/envs/dotmkl3/include" \
 -DOpenMP_C_LIB_NAMES="libiomp5" \
 -DOpenMP_CXX_FLAGS="-I/home/bg/anaconda3/envs/dotmkl3/include" \
 -DOpenMP_CXX_LIB_NAMES="libiomp5" \
 -DOpenMP_libiomp5_LIBRARY="/home/bg/anaconda3/envs/dotmkl3/lib/libiomp5.so" \
 `# CUDA` \
 -DUSE_CUDA=OFF ..

There is issue with DBLAS_LIBRARIES - after changing this to
-DBLAS_LIBRARIES="/home/bg/anaconda3/envs/dotmkl3/lib/libmkl_rt.so" everything seems to be fine

image

@matteosal
Copy link
Contributor Author

matteosal commented Jul 8, 2021

Linking to libmkl_rt without setting special MKL environment variables is equivalent to linking to the LP64 version of MKL: https://software.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/linking-your-application-with-the-intel-oneapi-math-kernel-library/linking-quick-start/using-the-single-dynamic-library.html
This is confirmed by your MKL_VERBOSE output, reporting lp64. So this is a no go for me, as I have to use ILP64.

But this is also telling us that the problem is likely to be about some integer size mismatch.

One way to confirm this is to run the example with the library linked to libmkl_rt specifying MKL_INTERFACE_LAYER=ILP64 (https://software.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/linking-your-application-with-the-intel-oneapi-math-kernel-library/linking-in-detail/dynamically-selecting-the-interface-and-threading-layer.html). I expect it to crash in this case.
Another test is to go back to passing all the BLAS_LIBRARIES explicitly, but specifying the LP64 version there and in BLA_VENDOR. This setup should not crash.

I'm trying to test these setups myself but I'm having other kinds of unrelated problems blocking me right now.

@matteosal
Copy link
Contributor Author

Another test is to go back to passing all the BLAS_LIBRARIES explicitly, but specifying the LP64 version there and in BLA_VENDOR. This setup should not crash.

I just managed to verify this

@bgawrych
Copy link
Contributor

bgawrych commented Jul 9, 2021

@matteosal Then why you're disabling it in cmake (DUSE_INT64_TENSOR_SIZE)?

cmake -GNinja \
 `# GENERAL FLAGS` \
 -DCMAKE_BUILD_TYPE=Release \
 -DCMAKE_SKIP_BUILD_RPATH=On \
 -DUSE_OPENCV=OFF \
 -DUSE_F16C=Off `# float16 support`\
 -DUSE_INT64_TENSOR_SIZE=ON \
 -DCMAKE_C_FLAGS_RELEASE="-DNDEBUG" \
 -DCMAKE_CXX_FLAGS_RELEASE="-DNDEBUG" \
 `# MATH BACKENDS` \
 -DBLAS=MKL \
 -DUSE_LAPACK=OFF \
 -DUSE_ONEDNN=OFF \
 -DBLA_STATIC=OFF \
 -DMKL_USE_SINGLE_DYNAMIC_LIBRARY=OFF \
 -DMKL_INCLUDE_DIR=/usr/local/include \
 -DBLAS_LIBRARIES="/usr/local/lib/libmkl_rt.so" \
 -DUSE_OPENMP=ON \
 -DOpenMP_C_FLAGS="-I/usr/local/include" \
 -DOpenMP_C_LIB_NAMES="libiomp5" \
 -DOpenMP_CXX_FLAGS="-I/usr/local/include" \
 -DOpenMP_CXX_LIB_NAMES="libiomp5" \
 -DOpenMP_libiomp5_LIBRARY="/usr/local/lib/libiomp5.so" \
 `# CUDA` \
 -DUSE_CUDA=OFF ..

This one works for me with large tensor support - tested with tests/nightly/test_large_array.py::test_nn <- modified as it doesn't work right now on master

@matteosal
Copy link
Contributor Author

matteosal commented Jul 10, 2021

@matteosal Then why you're disabling it in cmake (DUSE_INT64_TENSOR_SIZE)?

I was keeping USE_INT64_TENSOR_SIZE disabled because it was causing this problem: #19841
But that was on branch 1.x, we are now switching to 2.0 and I've just verified that there is no problem there. So I'd say that ILP64 + USE_INT64_TENSOR_SIZE=ON satisfies all my needs 😃

Anyway, are USE_INT64_TENSOR_SIZE and ILP64 MKL actually expected to go together? If they are not, there should be an explicit consistency check in the cmake script failing with an appropriate message. Building with OpenBLAS triggers this check, but MKL doesn't. I should also say that building with -DUSE_INT64_TENSOR_SIZE=OFF + ILP64 MKL didn't cause the batch_dot crash in mxnet 1.x (1.6 in particular). Another question: does the CI test a build linking to ILP64 MKL?

@bgawrych
Copy link
Contributor

I see only OpenBlas + USE_INT64_TENSOR_SIZE in tests scripts - MKL probably is not tested in this configuration - Can you confirm @leezu @szha ?
I will add similar check for MKL as for OpenBLAS :)

@szha
Copy link
Member

szha commented Jul 13, 2021

I'm not aware of an explicit check for that combo

@chinakook
Copy link
Contributor

chinakook commented Aug 1, 2021

I have this problem on CPU too. In Anaconda base environment it can run batch_dot and dot both, but in a new user created conda environment it get crashed. MXNet compiled with MKL and that compiled with Openblas are both crashed in this case.

@bartekkuncer
Copy link
Contributor

I have this problem on CPU too. In Anaconda base environment it can run batch_dot and dot both, but in a new user created conda environment it get crashed. MXNet compiled with MKL and that compiled with Openblas are both crashed in this case.

@chinakook What version of mxnet are you using? Can you provide us with output logs with MKLDNN_VERBOSE and MKL_VERBOSE flags set to 1? (e.g. MKLDNN_VERBOSE=1 MKL_VERBOSE=1 python script_name.py)

@chinakook
Copy link
Contributor

@bartekkuncer I've tested your flags, and it's not relevant to mkl. In my case, Anaconda can run dot without crash, but Miniconda cannot run it even if I copy all files from anaconda base env to miniconda user created env.
My MXNet is a custom 2.0 version on windows. My test scripts:

import mxnet as mx

def test_dot():
    ctx= mx.cpu(0)
    # smaller ndarray can run without error
    a = mx.nd.random.uniform(shape=(536, 771, 3), ctx=ctx)
    b = mx.nd.random.uniform(shape=(3, 3), ctx=ctx)
    c = mx.nd.dot(a, b)
    mx.nd.waitall()
    print(c.shape)

if __name__ == '__main__':
    test_dot()

@chinakook
Copy link
Contributor

@bartekkuncer I used the openblas offered by mxnet officially, and It's fixed. Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants