Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KNOWN TRAP]: segfault with stl.h wrapper for std::map (and no optimization enabled) #4395

Closed
wants to merge 3 commits into from

Conversation

rhaschke
Copy link
Contributor

@rhaschke rhaschke commented Dec 11, 2022

Description

When using both, a Debug build and the stl.h wrapper for std::map, I ran into a surprising segfault.
I spent a whole day tracking that error down to pybind11 being compiled with no optimization options enabled.
I tried to come up with a minimal example reproducing exactly my issue - but I failed.

However, I was able to reproduce the issue in a different context. For now, this PR provides this minimal example.
What is really weird is that adding a new wrapper in test_stl.cpp triggers a bug in test_stl_binders.

Note, that I had to add additional gcc and clang build configs to CI (with optimizations disabled) to trigger the bug in Linux. On Windows, it shows up with the existing jobs already.

requires:
- a build with all optimizations disabled
- the definition of a function accepting std::map<std::string, double>
@rhaschke rhaschke requested a review from henryiii as a code owner December 11, 2022 04:41
... which fails essentially in my use case, but still succeeds here.
@rhaschke rhaschke marked this pull request as draft December 11, 2022 05:11
@rhaschke
Copy link
Contributor Author

The backtrace of the segfault reveals that the std::map passed to C++ is corrupted for some reason. As the behavior is very different across different compilers, I am suspecting a compiler bug. I'm not sure though. I wasn't able to come up with a minimal example just moving std::map around.

In the build with debug output (1f0677a), it is interesting to compare both clang C++14 builds:

@rhaschke
Copy link
Contributor Author

I will work around this issue for now, by always ensuring some optimization options.
However, I just noticed that all(?) gcc builds and most Windows builds fail on that issue too when the map is actually used.

@rwgk
Copy link
Collaborator

rwgk commented Dec 11, 2022

Hi Robert, you seem to have an ODR violation:

@rwgk
Copy link
Collaborator

rwgk commented Dec 11, 2022

I didn't realize this before: I'm afraid PR #4353 made it a lot more likely that ODR violations sneak in.
EDIT: See next comment.

@rwgk
Copy link
Collaborator

rwgk commented Dec 11, 2022

Actually, no, I take that back, after taking a fresh look at stl_bind.h ... and the error message in the #4396 PR description (showing the source locations).

@rwgk
Copy link
Collaborator

rwgk commented Dec 11, 2022

The CI for #4396 is nearly complete. I'll close the PR soon, but the CI results will expire only later:

@rhaschke
Copy link
Contributor Author

Thanks, Ralf for your extremely quick reply!

You seem to have an ODR violation.

Where does that come from? Isn't including pybind11/stl.h and defining a wrapper with a std::map argument a valid approach? I just checked other builds:

I also checked my original build, which builds on top of old pybind11 2.6.2: this doesn't show ODR warnings either - but fails as described. Maybe, that's still another issue though.

@rwgk
Copy link
Collaborator

rwgk commented Dec 11, 2022

Where does that come from? Isn't including pybind11/stl.h and defining a wrapper with a std::map argument a valid approach?

It depends on how you compile & link.

Generally, there are three lines of defense against ODR violations:

  • pybind11 namespace hidden
  • -fvisibility=hidden
  • RTLD_LOCAL (default under Linux, a given under Windows (no other option), I'm not sure about macOS)

But

if you link multiple Translation Units (critical keyword when it comes to ODR) into one extension, all those defenses do not exist, on all platforms. — Note that most pybind11 unit tests are linked together as pybind11_tests.so. (You can see that more clearly here (compared to the cmake files).)

Specific to type_caster ODR violations, the critical code to consider is:

template <typename type>
using make_caster = type_caster<intrinsic_t<type>>;

It needs to resolve to the exact same template instantiations in all TUs linked together. (There is zero wiggle room for this, as we learned the hard way.)

With this PR, the ODR arises from

  • test_stl.cpp — SourceLocation1="/usr/local/google/home/rwgk/forked/pybind11/include/pybind11/stl.h:161"
  • test_stl_binders.cpp
    SourceLocation2="/usr/local/google/home/rwgk/forked/pybind11/include/pybind11/detail/type_caster_base.h:913

I.e. the type_caster specializations work out differently in the two TUs, and when those are linked together you get ODR violations.

I think of that has the "worst kind of trap set by pybind11". It really is not obvious.

One way to not have to worry about it so much: Set up your system with "one TU per extension". But that's not always feasible, then you have to think twice, and if things get really big (e.g. Google linking most things statically), test with the type_caster_odr_guard enabled (smart_holder branch only, but you can swap it in for testing without changing anything else).

@rwgk rwgk changed the title [BUG]: segfault with stl.h wrapper for std::map (and no optimization enabled) [KNOWN TRAP]: segfault with stl.h wrapper for std::map (and no optimization enabled) Dec 11, 2022
@rhaschke
Copy link
Contributor Author

I'm fine to close this issue report. However, you might want to check the ODR warnings popping up in the current master branch: https://github.com/pybind/pybind11/actions/runs/3659994679/jobs/6186623695

@rwgk
Copy link
Collaborator

rwgk commented Dec 12, 2022

I'm fine to close this issue report. However, you might want to check the ODR warnings popping up in the current master branch: https://github.com/pybind/pybind11/actions/runs/3659994679/jobs/6186623695

Wow, thanks for pointing those out. Amazingly, I never noticed those before, even though I look at CI logs pretty regularly.

Continuing under PR #4398. I think it's a very different problem.

@rhaschke rhaschke deleted the debug-mode-segfault branch December 12, 2022 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants