-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failure for Ubuntu 20.04 ppc64le architecture #161
Test failure for Ubuntu 20.04 ppc64le architecture #161
Comments
Looks like this UT is not runnable in any host other than x86. We will fix this in the the release. Also, will be grateful if you can fix it and contribute this to Knowhere repo. |
/assign @Presburger |
Well I tested this on ubuntu:20.04 (aarch64) as well as almalinux:8 (aarch64) and the tests run successfully for that architecture. I would like to contribute the PR to make this work on ppc64le architecture, but currently I have no idea why this fails :/ That's why I'm asking if you guys have an idea :) However I will create a PR to support ppc64le at all, since the build process does work. |
@mgiessing Hi There is a large amount of architecture-related intrinsic function code inside 'knowhere'. Currently, supporting PPC is not cost-effective for us. Thank you. |
This is totally understandable and I don't expect you to implement ppc64le specific vector code etc. :) My initial question was more about if you know why this test might fail because I could not make any sense of it and the only intrinsic code I've seen/found was As I said I will create a PR to support ppc64le at all + if I got more time I will implement VSX for accelerated SIMD operations. Thanks! |
Thanks for your contribution! I will hold this util:
|
It is because when we init knowhere, we will dynamicly set simd level by our cpu instruction set, you could refer to |
@mgiessing Could you please rebase your #162 on top of #163 which got merged? Thanks |
@alexanderguzhva I just rebased my PR :-) @chasingegg Thanks a lot for you hint, after adding a ppc64le section (using +
+#if defined(__powerpc64__)
+ fvec_inner_product = fvec_inner_product_ref;
+ fvec_L2sqr = fvec_L2sqr_ref;
+ fvec_L1 = fvec_L1_ref;
+ fvec_Linf = fvec_Linf_ref;
+
+ fvec_norm_L2sqr = fvec_norm_L2sqr_ref;
+ fvec_L2sqr_ny = fvec_L2sqr_ny_ref;
+ fvec_inner_products_ny = fvec_inner_products_ny_ref;
+ fvec_madd = fvec_madd_ref;
+ fvec_madd_and_argmin = fvec_madd_and_argmin_ref;
+
+ simd_type = "GENERIC";
+ support_pq_fast_scan = false;
+#endif
} However now it fails during binary search map test, I'll try to spend some time this evening/week to debug that further: [...]
I1025 08:34:52.465216 195767 factory.cc:20] [KNOWHERE][Create][knowhere_tests] create knowhere index BIN_IVF_FLAT with version 1
{"dim":8,"enable_mmap":true,"k":5,"metric_type":"SUPERSTRUCTURE","nlist":16,"nprobe":8}
terminate called after throwing an instance of 'faiss::FaissException'
what(): Error in virtual void faiss::IndexBinaryIVF::train(faiss::IndexBinary::idx_t, const uint8_t*) at /knowhere/thirdparty/faiss/faiss/IndexBinaryIVF.cpp:300: IVF not to support Substructure and Superstructure.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
knowhere_tests is a Catch2 v3.3.1 host application.
Run with -? for options
-------------------------------------------------------------------------------
Search binary mmap
Test Search
-------------------------------------------------------------------------------
/knowhere/tests/ut/test_mmap.cc:372
...............................................................................
/knowhere/tests/ut/test_mmap.cc:373: FAILED:
{Unknown expression after the reported line}
due to a fatal error condition:
name := "BIN_IVF_FLAT"
cfg_json := "{"dim":8,"enable_mmap":true,"k":5,"metric_type":
"SUPERSTRUCTURE","nlist":16,"nprobe":8}"
SIGABRT - Abort (abnormal termination) signal
===============================================================================
test cases: 15 | 14 passed | 1 failed
assertions: 112368250 | 112368249 passed | 1 failed
Aborted (core dumped) Thanks for all your help so far! |
Hi @mgiessing , sorry for being late. I've tried reproducing this issue on QEMU ppc64 and it seems that it can be reproduced! So, basically, there's something wrong with the exception handling O_o (to my BIG Surprise), something non-trivial. I'll take a further look. Meanwhile, please feel free to rebase your PR on top of the master branch, including changes for the hook and I'll accept your change. Thanks. |
No problem - I appreciate your effort looking into this :) I also tried to debug a bit further with gdb, however I wasn't entirely sure if this had something to do with steps involved before throwing the exception (e.g. at 11: Btw. this backtrace was from RHEL8 with IBM advanced toolchain 15 (gcc 11.4.1), but the error is the same as on Ubuntu/gcc (gdb) bt
#0 0x00007fff7ef94a7c in pthread_kill () from /opt/at15.0/lib64/glibc-hwcaps/power9/libc.so.6
#1 0x00007fff7ef2ecdc in raise () from /opt/at15.0/lib64/glibc-hwcaps/power9/libc.so.6
#2 0x00007fff7ef0c554 in abort () from /opt/at15.0/lib64/glibc-hwcaps/power9/libc.so.6
#3 0x00007fff7f2944a8 in __gnu_cxx::__verbose_terminate_handler() () from /opt/at15.0/lib64/glibc-hwcaps/power9/libstdc++.so.6.0.29
#4 0x00007fff7f28fb84 in ?? () from /opt/at15.0/lib64/glibc-hwcaps/power9/libstdc++.so.6.0.29
#5 0x00007fff7f28db78 in ?? () from /opt/at15.0/lib64/glibc-hwcaps/power9/libstdc++.so.6.0.29
#6 0x00007fff7f28eee8 in __gxx_personality_v0 () from /opt/at15.0/lib64/glibc-hwcaps/power9/libstdc++.so.6.0.29
#7 0x00007fff81b4e5e8 in _Unwind_Phase2 (context=0x7fffda4d5190, exception_object=0x30452ee0) at /root/.conan/data/libunwind/1.6.2/_/_/build/b68b207efa3d35074600f068c0f047030ce18960/src/src/unwind/unwind-internal.h:118
#8 _Unwind_Resume (exception_object=0x30452ee0) at /root/.conan/data/libunwind/1.6.2/_/_/build/b68b207efa3d35074600f068c0f047030ce18960/src/src/unwind/Resume.c:37
#9 0x00007fff8156e78c in faiss::IndexBinaryIVF::train (this=0x30344910, n=1000, x=0x30333db0 "%P`\022IN<<\017-\017\n\005.W!<\016GA\002\005aHT^\025") at /root/git/knowhere/thirdparty/faiss/faiss/IndexBinaryIVF.cpp:323
#10 0x00007fff81538744 in knowhere::IvfIndexNode<faiss::IndexBinaryIVF>::Train (this=0x3042ec50, dataset=..., cfg=...) at /root/git/knowhere/src/index/ivf/ivf.cc:330
#11 0x00007fff81480600 in knowhere::IndexNode::Build (this=0x3042ec50, dataset=..., cfg=...) at /root/git/knowhere/include/knowhere/index_node.h:41
#12 0x00007fff8146a8a8 in knowhere::Index<knowhere::IndexNode>::Build (this=0x7fffda4d6c20, dataset=..., json=...) at /root/git/knowhere/src/common/index.cc:44
#13 0x000000001009863c in CATCH2_INTERNAL_TEST_26 () at /root/git/knowhere/tests/ut/test_mmap.cc:382
#14 0x00000000101baea4 in Catch::TestInvokerAsFunction::invoke (this=0x30320370) at src/catch2/internal/catch_test_case_registry_impl.cpp:149
#15 0x00000000101a9570 in Catch::TestCaseHandle::invoke (this=0x3033c070) at src/catch2/../catch2/catch_test_case_info.hpp:115
#16 0x00000000101a81e0 in Catch::RunContext::invokeActiveTestCase (this=0x7fffda4d75e0) at src/catch2/internal/catch_run_context.cpp:541
#17 0x00000000101a7e94 in Catch::RunContext::runCurrentTest (this=0x7fffda4d75e0, redirectedCout=..., redirectedCerr=...) at src/catch2/internal/catch_run_context.cpp:504
#18 0x00000000101a5fc4 in Catch::RunContext::runTest (this=0x7fffda4d75e0, testCase=...) at src/catch2/internal/catch_run_context.cpp:235
#19 0x0000000010124ad0 in Catch::(anonymous namespace)::TestGroup::execute (this=0x7fffda4d75d0) at src/catch2/catch_session.cpp:110
#20 0x0000000010126590 in Catch::Session::runInternal (this=0x7fffda4d7940) at src/catch2/catch_session.cpp:332
#21 0x0000000010125ef0 in Catch::Session::run (this=0x7fffda4d7940) at src/catch2/catch_session.cpp:263
#22 0x000000001011e66c in Catch::Session::run<char> (this=0x7fffda4d7940, argc=1, argv=0x7fffda4d7f08) at src/catch2/../catch2/catch_session.hpp:41
#23 0x000000001011e48c in main (argc=1, argv=0x7fffda4d7f08) at src/catch2/internal/catch_main.cpp:36 When using gdb and setting breakpoint to FAISS_THROW_MSG() to go step-by-step on x86 & ppc64le the Intel system handled the exception correctly (going to https://github.com/zilliztech/knowhere/blob/v2.2.2/src/index/ivf/ivf.cc#L333) whereas Power threw the SIGABRT (via libunwind / unwind-internal.h) Thanks! |
@mgiessing , yep, this is exactly what I see.
|
Nevertheless, it should not affect the knowhere correctness. I bet that it is related to using some wrong system libraries somewhere |
@mgiessing would you be able to try to compile using the most recent gcc or even clang-17 ? I see certain things on the internet related to libunwind issues |
@alexanderguzhva just rebased my PR Yeah, let me try to use newer gcc/clang |
@mgiessing meanwhile, I'm trying to rebuild a newer version of libunwind and use it. They have specific instructions for this https://github.com/libunwind/libunwind#building-for-powerpc64--linux |
@alexanderguzhva A few updates:
|
@mgiessing clang is doable, let me do that in qemu |
@mgiessing it takes forever in qemu to build it, so meanwhile you could try the following:
|
@alexanderguzhva I tried to build using clang17 as you indicated but still get errors related to boost:
I changed the conan profile to clang: [settings]
os=Linux
os_build=Linux
arch=ppc64le
arch_build=ppc64le
compiler=clang
compiler.version=17
compiler.libcxx=libstdc++
build_type=Release Also, I added the mentioned code parts [...]
version_cxx11_standard_json = self._min_compiler_version_default_cxx11
if version_cxx11_standard_json:
if Version(self.settings.compiler.version) < version_cxx11_standard_json:
self.options.without_fiber = True
self.options.without_json = True
self.options.without_nowide = True
self.options.without_url = True
self.options.without_wave=True
self.options.without_locale=True
self.options.without_math=True
self.options.without_graph=True
else:
self.options.without_fiber = True
self.options.without_json = True
self.options.without_nowide = True
self.options.without_url = True
self.options.without_wave=True
self.options.without_wave=True
self.options.without_locale=True
self.options.without_math=True
self.options.without_graph=True
[...] Any idea what's going wrong here? |
@mgiessing Yes,
Also, it seems that you may need to add something like |
That worked, the conan install succeeded - thanks! I'm not very experienced with clang, but the
Environment variables:
I tried to google for it, but mostly this issue seem to occur on MacOS but not Linux. |
@mgiessing for ubuntu the fix is |
I was able to install via rpmfind, however same error :/ $ rpm -qa | grep libomp
libomp-17.0.2-1.module_el8+721+8e6a0389.ppc64le
libomp-devel-17.0.2-1.module_el8+721+8e6a0389.ppc64le I might give ubuntu a try tomorrow, however I assume there must be a way to make this run on rpm distros :) @alexanderguzhva You used ubuntu:20.04 or newer? |
@mgiessing both ubuntu 22.04 and 20.04 |
May I ask how you installed clang17 on Ubuntu on Power? Option a) Github releaseThe github releases are just built for RPM-based distros (RHEL): https://github.com/llvm/llvm-project/releases/tag/llvmorg-17.0.5 --> just powerpc64le RHEL8.8 (besides AIX) Option b) Using llvm-toolchainAnd going the official way using wget https://apt.llvm.org/llvm.sh
chmod +x llvm.sh
./llvm.sh 17 doesn't work because there is no deb package for Power :/ See: // Ubuntu20.04 // Ubuntu22.04 |
@mgiessing as I'm running on qemu which is very slow, I've decided to start from clang-15, which is available in a form of package. Otherwise, I would compile clang-17 from the scratch, if needed. Please try clang-15 or clang-14 or earlier versions, let's check if the problem is the GCC compiler itself |
I've been able to build knowhere with (system) clang-10 on ubuntu:20.04 but faced the same error: $ ./Release/tests/ut/knowhere_tests "Search binary mmap"
[...]
terminate called after throwing an instance of 'faiss::FaissException'
what(): Error in virtual void faiss::IndexBinaryIVF::train(faiss::IndexBinary::idx_t, const uint8_t *) at /knowhere/thirdparty/faiss/faiss/IndexBinaryIVF.cpp:301: IVF not to support Substructure and Superstructure.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
knowhere_tests is a Catch2 v3.3.1 host application.
Run with -? for options
-------------------------------------------------------------------------------
Search binary mmap
Test Search
-------------------------------------------------------------------------------
/knowhere/tests/ut/test_mmap.cc:372
...............................................................................
/knowhere/tests/ut/test_mmap.cc:376: FAILED:
{Unknown expression after the reported line}
due to a fatal error condition:
name := "BIN_IVF_FLAT"
cfg_json := "{"dim":8,"enable_mmap":true,"k":5,"metric_type":
"SUPERSTRUCTURE","nlist":16,"nprobe":8}"
SIGABRT - Abort (abnormal termination) signal
===============================================================================
test cases: 2 | 1 passed | 1 failed
assertions: 141 | 140 passed | 1 failed
Aborted (core dumped)
$ clang --version
clang version 10.0.0-4ubuntu1
Target: powerpc64le-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ env | grep clang
CXX=/usr/bin/clang++
CC=/usr/bin/clang
|
@mgiessing Well, then I'll need to check this case more carefully. At least, I think that you may use Milvus/Knowhere because it does not throw exceptions too often internally :) |
@alexanderguzhva Yeah, I am able to run milvus (v2.3.1) successfully and had no core dump so far :) |
@mgiessing I bet that it is not only libunwind. I've tried a simple standalone throw-catch program, which replicates what happens inside milvus, and I was unable to replicate the issue so far. |
@mgiessing I am from the IBM Power porting team. If your sole purpose is to build Milvus, we have a port available here: I built knowhere (v2.2.2) as a part of milvus v2.3.3 on 22.04 Power, and ran the tests. Got this:
|
Hey @sumitd2 , thanks for your comment - I appreciate your effort. From your code snippet I cannot see if you were able to recreate that core dump (although it looks like because of the failed test case). Do you see any of these in your test? [...]
SIGABRT - Abort (abnormal termination) signal
[...]
Aborted (core dumped) Thank you! |
@mgiessing No, I did not see the core dump |
Also, can you please try libunwind/1.7.2 and "libunwind:shared": True in conanfile.py. I remember having seen the libunwind crash. You may also have to add gtest/1.14.0 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with |
Hi, I want to build knowhere (as part of milvus) for ppc64le and with minimal changes I'm able to successfully build it.
However, when I then run tests it'll fail with this SIGABRT message:
The only changes to the git are the following to enable ppc64 builds:
I'm aware that there will be no SIMD acceleration and only scalar computation is used. I've also tested the exact same code on ubuntu:20.04-aarch64 and there the tests finish successfully.
Anyone know what could be the issue or how to properly debug this?
Thanks!
Information on system & build
OS:
ubuntu:20.04
arch:
ppc64le
gcc:
9.4.0
(ubuntu 20.04 build-essential default)knowhere version:
v2.2.1
The text was updated successfully, but these errors were encountered: