Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variant_test fails on Fedora Rawhide #3749

Closed
junghans opened this issue Jun 5, 2020 · 47 comments
Closed

Variant_test fails on Fedora Rawhide #3749

junghans opened this issue Jun 5, 2020 · 47 comments

Comments

@junghans
Copy link
Member

junghans commented Jun 5, 2020

10/68 Test #10: Variant_test .....................***Failed    0.00 sec
Running 6 test cases...
unknown location(0): �[4;31;49mfatal error: in "pack_pair_test": std::out_of_range: vector::_M_range_check: __n (which is 1) >= this->size() (which is 1)�[0;39;49m
/builddir/build/BUILD/espresso/src/core/unit_tests/Variant_test.cpp(130): �[1;36;49mlast checkpoint�[0;39;49m
unknown location(0): �[4;31;49mfatal error: in "pack_map_test": std::out_of_range: vector::_M_range_check: __n (which is 1) >= this->size() (which is 1)�[0;39;49m
/builddir/build/BUILD/espresso/src/core/unit_tests/Variant_test.cpp(139): �[1;36;49mlast checkpoint�[0;39;49m
�[1;31;49m*** 2 failures are detected in the test module "Variant test"
�[0;39;49m

Build log here

@junghans
Copy link
Member Author

junghans commented Jun 5, 2020

Same on x86_64: build_x86_64.log.txt

@jngrad
Copy link
Member

jngrad commented Jun 5, 2020

Already fixed by #3725 😉

@junghans
Copy link
Member Author

junghans commented Jun 5, 2020

Ah and you even told me about it, let me test it quickly.

@junghans
Copy link
Member Author

junghans commented Jun 5, 2020

Hmm, now I am getting:

       Start   1: save_checkpoint_lb.cpu-p3m.cpu-lj-therm.lb_1
  1/146 Test   #1: save_checkpoint_lb.cpu-p3m.cpu-lj-therm.lb_1 ......***Failed    1.03 sec
/usr/include/c++/10/bits/stl_vector.h:1123: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::front() [with _Tp = char; _Alloc = boost::mpi::allocator<char>; std::vector<_Tp, _Alloc>::reference = char&]: Assertion '__builtin_expect(!this->empty(), true)' failed.
/usr/include/c++/10/bits/stl_vector.h:1123: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::front() [with _Tp = char; _Alloc = boost::mpi::allocator<char>; std::vector<_Tp, _Alloc>::reference = char&]: Assertion '__builtin_expect(!this->empty(), true)' failed.
/usr/include/c++/10/bits/stl_vector.h:1123: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::front() [with _Tp = char; _Alloc = boost::mpi::allocator<char>; std::vector<_Tp, _Alloc>::reference = char&]: Assertion '__builtin_expect(!this->empty(), true)' failed.
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 4143224 RUNNING AT buildvm-17.phx2.fedoraproject.org
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

for nearly all tests.

build_x86_64.log.txt

@junghans
Copy link
Member Author

junghans commented Jun 5, 2020

Reason I doing a rebuilt: https://bugzilla.redhat.com/show_bug.cgi?id=1843105

@jngrad
Copy link
Member

jngrad commented Jun 5, 2020

The F33FailsToInstall issue is quite unfortunate. There's a similar packaging issue with CentOS 8, when I tried to create a docker image, it failed due to missing python dependencies...

I've just created a Fedora 33 docker image and can reproduce the new error with OpenMPI. It's only happening for 2 or more MPI ranks. Here's my trace:

/usr/include/c++/10/bits/stl_vector.h:1123: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::front() [with _Tp = char; _Alloc = boost::mpi::allocator<char>; std::vector<_Tp, _Alloc>::reference = char&]: Assertion '__builtin_expect(!this->empty(), true)' failed.
[fc8380a4505c:08126] *** Process received signal ***
[fc8380a4505c:08126] Signal: Aborted (6)
[fc8380a4505c:08126] Signal code:  (-6)
[fc8380a4505c:08126] [ 0] /lib64/libc.so.6(+0x3cb70)[0x7efc20a8bb70]
[fc8380a4505c:08126] [ 1] /lib64/libc.so.6(gsignal+0x145)[0x7efc20a8bae5]
[fc8380a4505c:08126] [ 2] /lib64/libc.so.6(abort+0x116)[0x7efc20a74884]
[fc8380a4505c:08126] [ 3] /usr/lib64/openmpi/lib/libboost_mpi.so.1.73.0(boost::mpi::detail::packed_archive_send(boost::mpi::communicator const&, int, int, boost::mpi::packed_oarchive const&)+0xad)[0x7efc19b683bd]
[fc8380a4505c:08126] [ 4] /home/espresso/espresso/build/src/core/EspressoCore.so(+0x1dca7f)[0x7efc19d54a7f]
[fc8380a4505c:08126] [ 5] /home/espresso/espresso/build/src/core/EspressoCore.so(void Utils::Mpi::gather_buffer<ErrorHandling::RuntimeError>(std::vector<ErrorHandling::RuntimeError, std::allocator<ErrorHandling::RuntimeError> >&, boost::mpi::communicator, int)+0xa9)[0x7efc19d55089]
[fc8380a4505c:08126] [ 6] /home/espresso/espresso/build/src/core/EspressoCore.so(ErrorHandling::RuntimeErrorCollector::gatherSlave()+0x2e)[0x7efc19d50b1e]
[fc8380a4505c:08126] [ 7] /home/espresso/espresso/build/src/core/EspressoCore.so(mpi_loop()+0x110)[0x7efc19ca2220]
[fc8380a4505c:08126] [ 8] /home/espresso/espresso/build/src/python/espressomd/_init.so(+0xdcde)[0x7efc1a293cde]

The issue happens in boost::mpi::detail::packed_archive_send(). When I find the time next week, I'll recompile the boost::mpi library with debug info and attach GDB to both MPI ranks from within a docker container to look at the value of variables.

@jngrad jngrad added this to the Espresso 4.1.3 milestone Jun 5, 2020
@junghans
Copy link
Member Author

junghans commented Jun 8, 2020

Any news on this?

@fweik
Copy link
Contributor

fweik commented Jun 9, 2020

Isn't this again a instance of the well known problem where it tries to get a pointer by taking the address of the first element? (which is UB when die vector is empty). I think the eventual cause
is here: https://github.com/boostorg/mpi/blob/1c09f39948218d094a4fde9c09309580f33c5db5/include/boost/mpi/detail/packed_oprimitive.hpp#L41, not much we can do about this tho...

@fweik
Copy link
Contributor

fweik commented Jun 9, 2020

Seems like boost.mpi is still not using UB sanitizer in the CI...

@jngrad
Copy link
Member

jngrad commented Jun 9, 2020

Reproducible in the dev branch.

During initialization in espressomd/_init.pyx L37, mpi_loop() calls mpiCallbacks().loop(). This method iterates over a list of functors (MpiCallbacks.hpp:L595) and calls them with the current boost mpi communicator and a packed archive, in order to initialize mpi callbacks on slaves. This fails on a static callback, ErrorHandling::mpi_gather_runtime_errors_slave(). That's because we reduce a vector of runtime errors that is empty at initialization time. Since a std::vector of size 0 returns a null pointer from std::vector::data(), we end up providing a null pointer as the source destination of boost::mpi::communicator::irecv(). Perhaps such null pointers were ignored in the past. They are now checked and will trigger an assertion in /usr/include/c++/10/bits/stl_vector.h:1123. Not sure if this is an intended change from the boost::mpi library or a regression (see boostorg/mpi#81 for a related issue). This can be fixed from our side with a proper guard in src/utils/include/utils/mpi/gather_buffer.hpp, but this only reduces the number of failing tests to 16.

The same error is triggered in src/utils/include/utils/mpi/gatherv.hpp, e.g. when collecting particles by id: if one MPI rank has no particles matching the ids, it will communicate an empty std::vector. This is more difficult to diagnose, as it depends on the number of MPI ranks and the geometry of domain decomposition. Fixing the remaining 16 failing tests for 3 and 4 MPI ranks will not guarantee that we properly guarded all of our MPI infrastructure against empty vectors.

@fweik
Copy link
Contributor

fweik commented Jun 9, 2020

As you might guess I'm also not a huge fan of adding awkward stuff in our code to work around broken libraries...

@junghans
Copy link
Member Author

junghans commented Jun 9, 2020

Can you make a PR for boostorg/mpi? Then we can patch it in Fedora

@jngrad
Copy link
Member

jngrad commented Jun 9, 2020

I'm not a huge fan of the espresso MPI infrastructure :) It took me 5 hours to get this far and I still don't know if it's a regression in boost::mpi or if it's a side effect from gcc 10. I get the same failure with MPICH instead of OpenMPI.

@fweik
Copy link
Contributor

fweik commented Jun 9, 2020

Don't you agree that the line I pointed out is the root cause?

@fweik
Copy link
Contributor

fweik commented Jun 9, 2020

boostorg/mpi#81fixes a different instance of the same error.

@junghans
Copy link
Member Author

junghans commented Jun 9, 2020

@jwakely is there anything we can do to test boost-mpi on Fedora better?

@mkuron
Copy link
Member

mkuron commented Jun 9, 2020

grepping for [0] and front in Boost.MPI reveals several dozen lines potentially affected by the same bugs. Some places explicitly check for emptyness, some ensure non-emptyness by surrounding code, but for many the vector comes straight from a function argument and can thus be empty. All these places should probably be converted to calling c_data so that they benefit from the fix in boostorg/mpi#81. Here's a quick patch, entirely untested, that could eventually be submitted upstream and included in Fedora.

@jngrad
Copy link
Member

jngrad commented Jun 9, 2020

Following the GDB backtrace in the boost::mpi frames, we have a call to communicator::send(), which calls communicator::array_send_impl(), which calls detail::packed_archive_send(). In that function there's a call to packed_oarchive::address(), but taking the address of a dereferenced null pointer returns a null pointer (thanks to compiler optimizations). I still don't know how we get to std::vector::front(), though.

Fedora 33 currently has OpenMPI 4.0.4rc1 and boost::mpi 1.73.0. I've recompiled OpenMPI 4.0.4rc1, rc2, and rc3 from sources without UCX and boost::mpi 1.73.0 from sources and in all 3 cases couldn't reproduce the error.

@fweik
Copy link
Contributor

fweik commented Jun 9, 2020

@jwakely
Copy link

jwakely commented Jun 9, 2020

@jwakely is there anything we can do to boost-mpi on Fedora better?

@junghans is there a word missing from that question? "test" maybe?

If somebody has a fix for Boost.MPI we can certainly apply that to the Fedora boost packages.

@junghans
Copy link
Member Author

junghans commented Jun 9, 2020

@jwakely is there anything we can do to boost-mpi on Fedora better?

@junghans is there a word missing from that question? "test" maybe?

Yeah.

If somebody has a fix for Boost.MPI we can certainly apply that to the Fedora boost packages.

Well that is something, so we can at least unbreak Rawhide.

@mkuron
Copy link
Member

mkuron commented Jun 10, 2020

@jngrad, since you have already compiled the libraries from source, could you please test (and probably fix) https://github.com/espressomd/espresso/files/4754022/cdata.txt? That would get rid of the problem once and for all.

@jwakely
Copy link

jwakely commented Jun 10, 2020

@mkuron that very last change looks wrong. You've changed the function name as well as the body.

@fweik
Copy link
Contributor

fweik commented Jun 10, 2020

FYI, boostorg/mpi@28a73ea seems to pass CI. I'll open PR upstream... I made a PR (boostorg/mpi#119) upstream

@jngrad
Copy link
Member

jngrad commented Jun 10, 2020

Compiling boost with debug symbols

curl -sL https://dl.bintray.com/boostorg/release/1.73.0/source/boost_1_73_0.tar.bz2 | tar xj
cd boost_1_73_0
echo 'using mpi ;' > tools/build/src/user-config.jam
 ./bootstrap.sh --with-libraries=filesystem,mpi,serialization,test,system
./b2 -j $(nproc) install --prefix=/opt/boost variant=debug debug-symbols=on

won't generate ./stage/lib/cmake/boost_mpi-1.73.0/libboost_mpi-variant-shared.cmake (other submodules like test and serialization have one), which is looked for by CMake:

set(Boost_DEBUG ON)
find_package(Boost 1.65 REQUIRED mpi;serialization;filesystem;system;unit_test_framework)
-- Found Boost 1.73.0 at /opt/boost/lib/cmake/Boost-1.73.0
--   Requested configuration: QUIET REQUIRED COMPONENTS mpi;serialization;filesystem;system;unit_test_framework
-- BoostConfig: find_package(boost_headers 1.73.0 EXACT CONFIG REQUIRED QUIET HINTS /opt/boost/lib/cmake)
-- Found boost_headers 1.73.0 at /opt/boost/lib/cmake/boost_headers-1.73.0
-- BoostConfig: find_package(boost_mpi 1.73.0 EXACT CONFIG REQUIRED QUIET HINTS /opt/boost/lib/cmake)
-- Found boost_mpi 1.73.0 at /opt/boost/lib/cmake/boost_mpi-1.73.0
-- Boost toolset is gcc10 (GNU 10.1.1)
-- Scanning /opt/boost/lib/cmake/boost_mpi-1.73.0/libboost_mpi-variant*.cmake
--   Library has no variants and is considered not found
CMake Error at /opt/boost/lib/cmake/Boost-1.73.0/BoostConfig.cmake:141 (find_package):
  Found package configuration file:

    /opt/boost/lib/cmake/boost_mpi-1.73.0/boost_mpi-config.cmake

  but it set boost_mpi_FOUND to FALSE so package "boost_mpi" is considered to
  be NOT FOUND.  Reason given by package:

  No suitable build variant has been found.

Call Stack (most recent call first):
  /opt/boost/lib/cmake/Boost-1.73.0/BoostConfig.cmake:258 (boost_find_component)
  /usr/share/cmake/Modules/FindBoost.cmake:444 (find_package)
  CMakeLists.txt:265 (find_package)

-- Configuring incomplete, errors occurred!

@KaiSzuttor
Copy link
Member

the same build protocol works if not building DEBUG?

@jngrad
Copy link
Member

jngrad commented Jun 10, 2020

Without DEBUG I can compile and use boost, but there are no assertions, so I can't test if @fweik's PR actually solves the bug.

@jngrad
Copy link
Member

jngrad commented Jun 10, 2020

Took me two hours, but I finally got boost to compile in debug mode. The trick was to compile OpenMPI from sources first (instead of fetching openmpi-devel from the Fedora repo), then the boost build system generates a libboost_mpi-variant-shared.cmake. But I still can't reproduce the bug with that debug build of boost 1.73.0. I've tried again with

./b2 -j $(nproc) install --prefix=/opt/boost \
  variant=release threading=multi debug-symbols=on pch=off

to get closer to the build environment in the boost-openmpi section of the boost specfile, but it's still not reproducible.

@fweik
Copy link
Contributor

fweik commented Jun 10, 2020

I'd expect you need a debug version of the standard library? This is where the assert is triggered...

@jwakely
Copy link

jwakely commented Jun 10, 2020

No, there is no debug version. Just define _GLIBCXX_ASSERTIONS when compiling the code that does the out of bounds accesses. The assert is in templates in headers, not compiled into the standard library.

The entire fedora distro is built with that macro defined.

@fweik
Copy link
Contributor

fweik commented Jun 10, 2020

Ah you are right, I got it confused with _GLIBCXX_DEBUG which changes ABI...

@jngrad
Copy link
Member

jngrad commented Jun 10, 2020

Just define _GLIBCXX_ASSERTIONS when compiling the code that does the out of bounds accesses.

Thanks! I'm now able to reproduce the bug with boost compiled from sources. The b2 command is:

./b2 -j $(nproc) install --prefix=/opt/boost variant=release threading=multi \
  debug-symbols=on pch=off define=_GLIBCXX_ASSERTIONS

Applying the patch in boostorg/mpi#119 fixes the bug reported by @junghans for both openmpi-devel (OpenMPI 4.0.4rc1) and mpich-devel (MPICH 3.3.2).

@fweik
Copy link
Contributor

fweik commented Jun 10, 2020

Do you think we should turn on _GLIBCXX_ASSERTIONS in the CI on linux builds?

@junghans
Copy link
Member Author

@jwakely can you patch boostorg/mpi#119 into rawhide?

@jwakely
Copy link

jwakely commented Jun 10, 2020

Do you think we should turn on _GLIBCXX_ASSERTIONS in the CI on linux builds?

@fweik it can't hurt, and could be useful.

@junghans will do -- nice work, everybody.

@jngrad
Copy link
Member

jngrad commented Jun 10, 2020

Do you think we should turn on _GLIBCXX_ASSERTIONS in the CI on linux builds?

Depends on the extra runtime. If it adds more than 5 min to all builds, we could be more selective, e.g. enabling it only on the fedora and centos images, as well as maxset (which uses the latest Ubuntu packages and isn't a dependency of other jobs). I guess it shouldn't increase runtime by much, because our codebase uses mostly Vector instead of std::vector, and we have bounds checking in Vector.

@jngrad
Copy link
Member

jngrad commented Jun 10, 2020

Running the python tests on maxset in a Fedora 33 image with _GLIBCXX_ASSERTIONS (de1d978) only increases the runtime by 2 seconds, so we could add this define in our CI once boost 1.73.1 becomes available in Ubuntu. The bug is reproducible in our Ubuntu 20.04 image with boost 1.71.0 and gcc 8.4 (Travis-CI logfile) but not on ICP machines with Ubuntu 18.04 and boost 1.65.1 and gcc 7.5. I'm surprised @junghans didn't run into this bug before, _GLIBCXX_ASSERTIONS seems to have been introduced in Fedora 28, and is in the STL since June 2016 (gcc-mirror/gcc@bd2ee798d).

@junghans
Copy link
Member Author

@junghans will do -- nice work, everybody.

@jwakely, let me know when and I will attempt an espresso rebuild after that.

@mkuron
Copy link
Member

mkuron commented Jun 11, 2020

Do you think we should turn on _GLIBCXX_ASSERTIONS in the CI on linux builds?

If we do that, we can‘t use the distribution-packaged Boost anymore, but need to build our own.

Don‘t we have Fedora in CI already? Why didn‘t it catch this problem?

@jwakely
Copy link

jwakely commented Jun 11, 2020

If we do that, we can‘t use the distribution-packaged Boost anymore, but need to build our own.

No, that's not true.

Defining _GLIBCXX_ASSERTIONS doesn't change ABI and can be turned on or off per-translation unit. You won't check the entire program if some objects are built without it, but it's still more checking than not turning it on at all.

@mkuron
Copy link
Member

mkuron commented Jun 11, 2020

Did that change recently? I seem to recall that it used to break ABI compatibility. Might have been before they separated _GLIBCXX_DEBUG and _GLIBCXX_ASSERTIONS.

@jwakely
Copy link

jwakely commented Jun 11, 2020

"They" is me. I added _GLIBCXX_ASSERTIONS, and the whole point is that it doesn't alter the ABI the way that _GLIBCXX_DEBUG does.

@jwakely
Copy link

jwakely commented Jun 11, 2020

Anyway, the change in boostorg/mpi#119 seems to have a problem, please see my comment.

@jwakely
Copy link

jwakely commented Jun 11, 2020

@junghans there's a new Boost in rawhide now.

@junghans
Copy link
Member Author

@junghans
Copy link
Member Author

On s390x:

      Start 66: all_gatherv_test
66/68 Test #66: all_gatherv_test .................***Exception: SegFault 36.10 sec
Running 2 test cases...
�[1;32;49m*** No errors detected
�[0;39;49m

@junghans
Copy link
Member Author

All other archs pass, so I will exclude s390x for now and have opened an issue for the s390x issue (#3753)

@jngrad jngrad closed this as completed Jun 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants