-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variant_test fails on Fedora Rawhide #3749
Comments
Same on x86_64: build_x86_64.log.txt |
Already fixed by #3725 😉 |
Ah and you even told me about it, let me test it quickly. |
Hmm, now I am getting:
for nearly all tests. |
Reason I doing a rebuilt: https://bugzilla.redhat.com/show_bug.cgi?id=1843105 |
The I've just created a Fedora 33 docker image and can reproduce the new error with OpenMPI. It's only happening for 2 or more MPI ranks. Here's my trace:
The issue happens in |
Any news on this? |
Isn't this again a instance of the well known problem where it tries to get a pointer by taking the address of the first element? (which is UB when die vector is empty). I think the eventual cause |
Seems like boost.mpi is still not using UB sanitizer in the CI... |
Reproducible in the dev branch. During initialization in The same error is triggered in |
As you might guess I'm also not a huge fan of adding awkward stuff in our code to work around broken libraries... |
Can you make a PR for boostorg/mpi? Then we can patch it in Fedora |
I'm not a huge fan of the espresso MPI infrastructure :) It took me 5 hours to get this far and I still don't know if it's a regression in boost::mpi or if it's a side effect from gcc 10. I get the same failure with MPICH instead of OpenMPI. |
Don't you agree that the line I pointed out is the root cause? |
boostorg/mpi#81fixes a different instance of the same error. |
@jwakely is there anything we can do to test boost-mpi on Fedora better? |
grepping for |
Following the GDB backtrace in the Fedora 33 currently has OpenMPI 4.0.4rc1 and boost::mpi 1.73.0. I've recompiled OpenMPI 4.0.4rc1, rc2, and rc3 from sources without UCX and boost::mpi 1.73.0 from sources and in all 3 cases couldn't reproduce the error. |
The front comes from the binary version of oprimitiv, same story: https://github.com/boostorg/mpi/blob/1c09f39948218d094a4fde9c09309580f33c5db5/include/boost/mpi/detail/binary_buffer_oprimitive.hpp#L43 |
@jngrad, since you have already compiled the libraries from source, could you please test (and probably fix) https://github.com/espressomd/espresso/files/4754022/cdata.txt? That would get rid of the problem once and for all. |
@mkuron that very last change looks wrong. You've changed the function name as well as the body. |
FYI, boostorg/mpi@28a73ea seems to pass CI. |
Compiling boost with debug symbols curl -sL https://dl.bintray.com/boostorg/release/1.73.0/source/boost_1_73_0.tar.bz2 | tar xj
cd boost_1_73_0
echo 'using mpi ;' > tools/build/src/user-config.jam
./bootstrap.sh --with-libraries=filesystem,mpi,serialization,test,system
./b2 -j $(nproc) install --prefix=/opt/boost variant=debug debug-symbols=on won't generate set(Boost_DEBUG ON)
find_package(Boost 1.65 REQUIRED mpi;serialization;filesystem;system;unit_test_framework)
|
the same build protocol works if not building |
Without DEBUG I can compile and use boost, but there are no assertions, so I can't test if @fweik's PR actually solves the bug. |
Took me two hours, but I finally got boost to compile in debug mode. The trick was to compile OpenMPI from sources first (instead of fetching ./b2 -j $(nproc) install --prefix=/opt/boost \
variant=release threading=multi debug-symbols=on pch=off to get closer to the build environment in the boost-openmpi section of the boost specfile, but it's still not reproducible. |
I'd expect you need a debug version of the standard library? This is where the assert is triggered... |
No, there is no debug version. Just define The entire fedora distro is built with that macro defined. |
Ah you are right, I got it confused with |
Thanks! I'm now able to reproduce the bug with boost compiled from sources. The b2 command is: ./b2 -j $(nproc) install --prefix=/opt/boost variant=release threading=multi \
debug-symbols=on pch=off define=_GLIBCXX_ASSERTIONS Applying the patch in boostorg/mpi#119 fixes the bug reported by @junghans for both |
Do you think we should turn on |
@jwakely can you patch boostorg/mpi#119 into rawhide? |
Depends on the extra runtime. If it adds more than 5 min to all builds, we could be more selective, e.g. enabling it only on the fedora and centos images, as well as maxset (which uses the latest Ubuntu packages and isn't a dependency of other jobs). I guess it shouldn't increase runtime by much, because our codebase uses mostly |
Running the python tests on maxset in a Fedora 33 image with |
If we do that, we can‘t use the distribution-packaged Boost anymore, but need to build our own. Don‘t we have Fedora in CI already? Why didn‘t it catch this problem? |
No, that's not true. Defining |
Did that change recently? I seem to recall that it used to break ABI compatibility. Might have been before they separated _GLIBCXX_DEBUG and _GLIBCXX_ASSERTIONS. |
"They" is me. I added |
Anyway, the change in boostorg/mpi#119 seems to have a problem, please see my comment. |
@junghans there's a new Boost in rawhide now. |
On s390x:
|
All other archs pass, so I will exclude s390x for now and have opened an issue for the s390x issue (#3753) |
Build log here
The text was updated successfully, but these errors were encountered: