-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in gbl::GblTrajectory::prepare() #44188
Comments
cms-bot internal usage |
A new Issue was created by @TomasKello. @rappoccio, @sextonkennedy, @makortel, @smuzaffar, @antoniovilela, @Dr15Jones can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
this is very reminiscent of #43801. |
It fails also in the latest IB from this morning as well. |
assign alca |
New categories assigned: alca @saumyaphor4252,@perrotta,@consuegs you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Could you run the job in (I'd also suggest to use the "code block" formatting for the output, i.e. start and end the block with three backuotes |
Here is report when running with
I have also put the relevant log files in the following area, |
Thanks, so ASAN crashes within ASAN code itself. That points towards a pretty bad memory corruption. I ran the job (thanks for the easy and quick reproducer!) in UBSAN, but it crashed as in the issue description without adding any information. (ok, it did report
but I think that is reported also UBSAN IBs, so I didn't worry about it; even if undefined behavior is by definition undefined...) I'm checking with valgrind now. |
My valgrind is still running, but it is already showing things like
and (I'm guessing at this stage the memory has corrupted enough for anything to happen)
|
The |
|
With debug build of
|
Running the
Maybe the Eigen memory allocation strategy is still incorrect in some way? |
assign core |
New categories assigned: core @Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
I thought we had changed the build params for eigen so it wouldn't use the |
Could it be that gbl be compiled using different flags associated with Eigen than the rest of CMSSW? |
|
The call to EIGEN_DEVICE_FUNC inline void aligned_free(void *ptr)
{
#if (EIGEN_DEFAULT_ALIGN_BYTES==0) || EIGEN_MALLOC_ALREADY_ALIGNED
EIGEN_USING_STD(free)
free(ptr);
#else
handmade_aligned_free(ptr);
#endif
} |
#ifndef EIGEN_MALLOC_ALREADY_ALIGNED
// ...
#if defined(__GLIBC__) && ((__GLIBC__>=2 && __GLIBC_MINOR__ >= 8) || __GLIBC__>2) \
&& defined(__LP64__) && ! defined( __SANITIZE_ADDRESS__ ) && (EIGEN_DEFAULT_ALIGN_BYTES == 16)
#define EIGEN_GLIBC_MALLOC_ALREADY_ALIGNED 1
#else
#define EIGEN_GLIBC_MALLOC_ALREADY_ALIGNED 0
#endif
// ...
#if (EIGEN_OS_MAC && (EIGEN_DEFAULT_ALIGN_BYTES == 16)) \
|| (EIGEN_OS_WIN64 && (EIGEN_DEFAULT_ALIGN_BYTES == 16)) \
|| EIGEN_GLIBC_MALLOC_ALREADY_ALIGNED \
|| EIGEN_FREEBSD_MALLOC_ALREADY_ALIGNED
#define EIGEN_MALLOC_ALREADY_ALIGNED 1
#else
#define EIGEN_MALLOC_ALREADY_ALIGNED 0
#endif https://github.com/cms-externals/eigen-git-mirror/blob/cms/master/3bb6a48d8c171cf20b5f8e48bfb4e424fbd4f79e/Eigen/src/Core/util/Memory.h#L34-L39 |
Playing with the definitions of the various Eigen macros the The #if EIGEN_IDEAL_MAX_ALIGN_BYTES > EIGEN_MAX_ALIGN_BYTES
#define EIGEN_DEFAULT_ALIGN_BYTES EIGEN_IDEAL_MAX_ALIGN_BYTES
#else
#define EIGEN_DEFAULT_ALIGN_BYTES EIGEN_MAX_ALIGN_BYTES
#endif and #if defined(EIGEN_DONT_VECTORIZE)
#if defined(EIGEN_GPUCC)
// GPU code is always vectorized and requires memory alignment for
// statically allocated buffers.
#define EIGEN_IDEAL_MAX_ALIGN_BYTES 16
#else
#define EIGEN_IDEAL_MAX_ALIGN_BYTES 0
#endif
#elif defined(__AVX512F__)
// 64 bytes static alignment is preferred only if really required
#define EIGEN_IDEAL_MAX_ALIGN_BYTES 64
#elif defined(__AVX__)
// 32 bytes static alignment is preferred only if really required
#define EIGEN_IDEAL_MAX_ALIGN_BYTES 32
#else
#define EIGEN_IDEAL_MAX_ALIGN_BYTES 16
#endif Compiling (well, this just repeated #43801 (comment) in different words) |
Should we explicitly set |
From https://eigen.tuxfamily.org/dox/TopicPreprocessorDirectives.html
|
Looking at the build log of |
Dear core (@Dr15Jones, @makortel, @smuzaffar), our understanding of this is that the Segfault can be fixed if the compilation instructions used in cmssw are adjusted accordingly? (thanks @mmusich). Could the priority of this be increased? At the moment this failure is holding the TrkAl ReReco conditions back and, therefore, the derivation of all the subsequent conditions that depend on them. Thanks, AlCaDB team (@perrotta, @saumyaphor4252, @consuegs) |
Currently it seems the issue is in how Eigen-using externals are being built (so not in CMSSW itself). It is also possible that is not the only problem the job has. In above (#44188 (comment)) I noticed another (although lesser in practice) problem in
Is the problem or fix suggestion something AlCaDB and/or tracker alignment team could communicate to
This issue was already pretty high on the priority list, but good to know it really is high priority problem. The exact conditions of the problem are quite tricky, and therefore will, unfortunately, take time to fully resolve. |
@makortel Thanks for the extensive tests and help in debugging. We iterated with the |
cms-sw/cmsdist#9053 contains the mentioned update (not sure will compile fine though, an earlier update was failing checks) |
@makortel reporting some tests that I did, I built the relevant external packages with settings as in,
|
Thanks @sroychow. I tested with cms-sw/cmsdist#9043 (comment) and the (I'm still running valgrind to see if there would anything else hiding) |
Thanks @makortel so what we will need are the following,
|
cms-sw/cmsdist#9043 needs to be reworked (that @smuzaffar promised to do on Monday). Merging schedule of cms-sw/cmsdist#9053 + #44340 should be discussed further in cms-sw/cmsdist#9053 and is up to @cms-sw/externals-l2 and @cms-sw/orp-l2. As far as I'm concerned, if the latest round of tests pass, they would be good to go. |
In the meantime my valgrind job finished, and did not reveal anything new. |
Great that looks like we are converging to solution. Thanks for all your work. We should then carry on with discussion in cms-sw/cmsdist#9053 |
For 14.1.X:
Can anyone please try to re-run the test in latest 14.1.X IB? For 14.0.X:
|
@smuzaffar I ran the test now with |
+core |
+alca |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
[Find reproducible example below]
Stack trace from CMSSW_14_0_0 caused by gbl::GblTrajectory::prepare(), executed on EL8 (el8_amd64_gcc12):
######### TO REPRODUCE #########
cd /afs/cern.ch/cms/CAF/CMSALCA/ALCA_TRACKERALIGN/MP/MPproduction/CMSSW_14_0_0
cmsenv
cd /afs/cern.ch/work/s/sroychow/public/TkDPG/segfaultmille/scripts
cmsRun mille_failing.py
FYI: @henriettepetersen @sroychow
The text was updated successfully, but these errors were encountered: