Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[smart_holder] type_caster ODR guard #4022

Merged
merged 85 commits into from
Jul 21, 2022
Merged

Conversation

rwgk
Copy link
Collaborator

@rwgk rwgk commented Jun 21, 2022

Description

In large-scale environments (e.g. Google) type_caster ODR violations are probably the biggest pybind11 pitfall. Without this PR, it is very difficult to ensure that a large application consistently uses the same type_caster<T> definition for a given T in all translation units linked together with mutual linker-level visibility. Accidents are to be expected and are very difficult to prevent & debug at scale, for example a mix of translation units that use PYBIND11_MAKE_OPAQUE + pybind11/stl_bind.h, and others that use pybind11/stl.h. Another type of accident is that custom type_caster includes are simply forgotten in some translation units, but by chance the linker picks all the right type_caster fragments (from other translation units), until one day it suddenly does not, because of some unrelated change in the environment (e.g. new compiler version, a system library update), resulting in seemingly inexplicable failures. Specific to the smart_holder branch, there is also the problem that the PYBIND11_SMART_HOLDER_TYPE_CASTERS macros may accidentally be missing in some translation units. In many cases runtime errors may give useful clues, but in general, the behavior is undefined, which is not acceptable in large-scale production environments.

Runtime detection of ODR violations is very elusive, especially when involving templates. Unfortunately, currently the ASAN detect_odr_violation option is known to not catch type_caster ODR violations. The type_caster ODR guard added with this PR is currently the only tool available, although it employs ODR itself to shield type_caster::name.sloc (source location) information from the linker mechanisms that discard all but one of multiple type_caster definitions. See the WARNINGs in pybind11/detail/descr.h and pybind11/detail/type_caster_odr_guard.h for details. The current implementation is known to work on almost all platforms that meet the compiler requirements (C++17, C++20), except one (gcc 9.4.0 Debug build). See comments below for details.

Note that the new test_type_caster_odr_guard_1, test_type_caster_odr_guard_2 pair deliberately violates the ODR (see C++ source code comments), but in a way that is evidently compatible with reliable unit testing, very similar to how the ODR in the guard itself is evidently of a benign kind.

In the pybind11 unit tests, the ODR guard is tested in as many environments as possible, but because of the ODR in the guard itself, it is NOT recommended to turn on the ODR guard in regular builds, production, or debug. The guard is meant to be used similar to a sanitizer, to check for type_caster ODR violations in binaries that are otherwise already fully tested and assumed to be healthy. To activate the ODR guard (outside pybind11/tests), define PYBIND11_ENABLE_TYPE_CASTER_ODR_GUARD_IF_AVAILABLE or PYBIND11_ENABLE_TYPE_CASTER_ODR_GUARD. See the long comment & code near the top of pybind11/detail/descr.h for details.

Suggested changelog entry:

@rwgk
Copy link
Collaborator Author

rwgk commented Jun 26, 2022

This PR was Google-globally tested in a couple configurations: HEAD, HEAD + change that motivated this PR. I found 6 missing includes for HEAD, and no additional issues for HEAD + change.

I did not find stl.h / stl_bind.h mixes in the wild, therefore I added a test here to be sure that that situation is actually covered for by the ODR guard (b98c9c1).

Attempting to explain the observations:

  • "Type A" ODR violations: conflicting type_caster specializations for a given UserType (e.g. the stl.h / stl_bind.h mix). — Apparently these shake out quickly because the casters behave differently, which is easily discovered via unit tests. To inexperienced pybind11 users this may come as a surprise, and it may take a significant amount of time to debug & discover that a certain include needs to be added, and possibly PYBIND11_MAKE_OPAQUE(), but apparently people are able to figure it out.

  • "Type B" ODR violations: users simply forget or do not know that they need to include a certain header for a custom type_caster, but another TU (translation unit) has the include. The linker will see the conflicting definitions: type_caster_base for the TU with the missing include, the custom type_caster for the other TUs. — Apparently sometimes the custom type_caster has a wining streak when the linker decides which definitions (!) to pick. In the general case this is more dicy than it may seem at first, because the linker may pick some pieces from one type_caster and some pieces from the other, it is undefined. Note that this can actually be observed by inspecting the GitHub Actions logs for this PR: look for the UNEXPECTED: messages in test_type_caster_odr_guard_1.py & test_type_caster_odr_guard_2.py (see below). What makes this even more problematic is that unrelated perturbations in the environment (e.g. small changes in the compiler, linker, system libraries) can change the behavior, and tests that used to work start to break, apparently out of the blue. Or worse, production systems suddenly misbehave, with no useful clue as to why.

UNEXPECTED: messages for CI #4984 testing b98c9c1:

$ grep UNEXPECTED: *.txt | cut -d']' -f2- | cut -dt -f2- | cut -d/ -f2- | cut -d_ -f2- | sort | uniq -c
      2 type_caster_odr_guard_1.py:46: UNEXPECTED: type_caster_odr_violation_detected_count() == 0 (9.4.0 C++17)
     17 type_caster_odr_guard_2.py:10: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (to_python).
      4 type_caster_odr_guard_2.py:11: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (to_python).
     12 type_caster_odr_guard_2.py:19: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (from_python).
     11 type_caster_odr_guard_2.py:20: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (from_python).
      1 type_caster_odr_guard_2.py:21: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (from_python).
     24 type_caster_odr_guard_2.py:9: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (to_python).

Ralf W. Grosse-Kunstleve added 28 commits June 28, 2022 00:20
…TER_SOURCE_FILE_LINE, baked into PYBIND11_TYPE_CASTER macro.
…esolves "unused" warning when compiling test_custom_type_casters.cpp
…ails to link. Trying GitHub Actions anyway to see if there are any platforms that support https://en.cppreference.com/w/cpp/language/tu_local before C++20. Note that Debian clang 13 C++17 works locally.
@rwgk
Copy link
Collaborator Author

rwgk commented Jul 20, 2022

After commit 0c27340 we have almost all platforms back, except these two:

1______3.6_____MSVC_2019_____x86.txt:2022-07-20T18:00:19.7655237Z   C++ Info: MSVC 192930146 C++14 __pybind11_internals_v4_msvc_debug_sh_def__
2______3.7_____MSVC_2019_____x86.txt:2022-07-20T17:58:35.9336625Z   C++ Info: MSVC 192930146 C++14 __pybind11_internals_v4_msvc_debug_sh_def__

I think I'll leave the PYBIND11_ENABLE_TYPE_CASTER_ODR_GUARD determination as it is now though, because it is a little simpler than before, and the MSVC 2019 C++14 mode is not very interesting.

Second-level check:

$ grep UNEXPECTED: *.txt | cut -d']' -f2- | cut -dt -f2- | cut -d/ -f2- | cut -d_ -f2- | sort | uniq -c
      3 type_caster_odr_guard_1.py:42: UNEXPECTED: type_caster_odr_violation_detected_count() == 0 (9.4.0 C++17)
     17 type_caster_odr_guard_2.py:10: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (to_python).
      4 type_caster_odr_guard_2.py:11: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (to_python).
     14 type_caster_odr_guard_2.py:19: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (from_python).
     11 type_caster_odr_guard_2.py:20: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (from_python).
      1 type_caster_odr_guard_2.py:21: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (from_python).
     28 type_caster_odr_guard_2.py:9: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (to_python).

@rwgk
Copy link
Collaborator Author

rwgk commented Jul 20, 2022

Fresh quality check:

$ grep 'ODR VIOLATION DETECTED' *.txt | cut -d: -f1 | uniq | sort -n
1______3.6_____ubuntu-latest_____x64_-DPYBIND11_FINDPYTHON=ON_-DCM.txt
1______3.8_____MSVC_2019__Debug______x86_-DCMAKE_CXX_STANDARD=17.txt
1______3.9_____MSVC_2022_C++20_____x64.txt
1______3_____windows-latest_____mingw64.txt
2______3.9_____MSVC_2019__Debug______x86_-DCMAKE_CXX_STANDARD=20.txt
2______3.9_____ubuntu-latest_____x64.txt
2______3_____windows-latest_____mingw32.txt
3______3.10_____ubuntu-latest_____x64.txt
3______3.8_____MSVC_2019_____x86_-DCMAKE_CXX_STANDARD=17.txt
3______3_____GCC_10_____C++20____x64.txt
4______3.11-dev_____ubuntu-latest_____x64.txt
4______3.9_____MSVC_2019_____x86_-DCMAKE_CXX_STANDARD=20.txt
5______pypy-3.7_____ubuntu-latest_____x64.txt
6______pypy-3.8_____ubuntu-latest_____x64_-DPYBIND11_FINDPYTHON=ON.txt
7______pypy-3.9_____ubuntu-latest_____x64.txt
8______3.6_____windows-2022_____x64.txt
8______3_____Clang_10_____C++20_____x64.txt
9______3.9_____windows-2022_____x64.txt
9______3_____Clang_10_____C++17_____x64.txt
10______3.10_____windows-2022_____x64.txt
11______3.11-dev_____windows-2022_____x64.txt
12______pypy-3.7_____windows-2022_____x64.txt
13______pypy-3.8_____windows-2022_____x64.txt
14______pypy-3.9_____windows-2022_____x64.txt
15______3.6_____macos-latest_____x64.txt
16______3.9_____macos-latest_____x64.txt
17______3.10_____macos-latest_____x64.txt
18______3.11-dev_____macos-latest_____x64.txt
19______pypy-3.7_____macos-latest_____x64.txt
20______pypy-3.8_____macos-latest_____x64.txt
21______pypy-3.9_____macos-latest_____x64.txt
22______3.6_____windows-2019_____x64_-DPYBIND11_FINDPYTHON=ON.txt
23______3.9_____windows-2019_____x64.txt
$ grep UNEXPECTED: *.txt | cut -d']' -f2- | cut -dt -f2- | cut -d/ -f2- | cut -d_ -f2- | sort | uniq -c
     17 type_caster_odr_guard_2.py:10: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (to_python).
      4 type_caster_odr_guard_2.py:11: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (to_python).
     14 type_caster_odr_guard_2.py:19: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (from_python).
     11 type_caster_odr_guard_2.py:20: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (from_python).
      1 type_caster_odr_guard_2.py:21: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (from_python).
     28 type_caster_odr_guard_2.py:9: UNEXPECTED: test_type_caster_odr_guard_2.cpp prevailed (to_python).
$ grep NO_INLINE *.txt | cut -d']' -f2- | cut -dt -f2- | cut -d/ -f2- | cut -d_ -f2- | sort | uniq -c
      3 type_caster_odr_guard_1.py:40: type_caster_odr_violation_detected_count() == 0: 9.4.0, C++17, __NO_INLINE__

@laramiel
Copy link
Contributor

Looks good.

Copy link
Contributor

@laramiel laramiel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good. I like that the core of this is relatively compact.

@rwgk
Copy link
Collaborator Author

rwgk commented Jul 21, 2022

Good. I like that the core of this is relatively compact.

Yes, without the change in descr.h and the new unit tests this would be a pretty small PR.

In case someone looks here in the future, regarding the change in descr.h, I had a choice:

  • Leave descr.h as-is and introduce a macro to be added to all type_casters, inside pybind11, and all user code. This was actually the first version of this PR (see commit history: before and including 38c2565), even globally tested Google-internally.

  • Change descr.h so that the ODR guard works for all type_casters out of the box. That's a bigger change to pybind11, but obviously much easier to deploy.

Speculating:

Hopefully one day the ASAN detect_odr_violation option will detect the ODR here, ideally

  • in the unit tests (test_type_caster_odr_guard_?.cpp),
  • and the ODR guard itself.

When that happens, maybe we can delete everything that was added with this PR, except src_loc, as an easy target for ASAN to alert on (mismatch in constexpr when comparing multiple definitions for a given type_caster<T>).

Crossing into the realm of wishful thinking, even better of course:

  • Change the pybind11 type_caster system to be less of a setup for accidentally creating ODR violations.
  • Revert this PR entirely.

It is very uncertain though how much work that will be, and how disruptive it will be to the API.

@rwgk rwgk marked this pull request as ready for review July 21, 2022 13:38
@rwgk rwgk requested a review from henryiii as a code owner July 21, 2022 13:38
@rwgk rwgk merged commit c557f9a into pybind:smart_holder Jul 21, 2022
@rwgk rwgk deleted the odr_guard_sh branch July 21, 2022 13:38
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Jul 21, 2022
@rwgk rwgk removed the needs changelog Possibly needs a changelog entry label Jul 21, 2022
rwgk pushed a commit to rwgk/pybind11 that referenced this pull request Mar 24, 2023
rwgk pushed a commit that referenced this pull request Mar 24, 2023
* use C++17 syntax to get rid of recursive template instantiations for concatenating type signatures (#4587)

* Apply descr.h `src_loc` change (smart_holder PR #4022) to code added with master PR #4587

* Add test_class_sh_property_non_owning to CMakeLists.txt (fixes oversight in PR #4586)

* Resolve clang-tidy errors.

* clang-tidy auto fix

---------

Co-authored-by: Konstantin Bespalov <[email protected]>
rwgk pushed a commit to rwgk/stuff that referenced this pull request Apr 29, 2023
…_guard() constructor accessing translation_unit_local.value
cielavenir pushed a commit to cielavenir/pybind11 that referenced this pull request Jun 12, 2023
… namespace detail). (pybind#4049)

Very minor refactoring to ease development and debugging.

Having to declare a local `std::string` has bugged me many times. Nice to get this little nuisance out of the way.

Extracted from PR pybind#4022, where it is used like this:

```
    std::fprintf(stdout,
                 "\nTYPE_CASTER_ODR_GUARD_IMPL %s %s\n",
                 clean_type_id(intrinsic_type_info.name()).c_str(),
                 source_file_line_from_sloc.c_str());
```
rwgk pushed a commit to rwgk/pybind11 that referenced this pull request Jul 20, 2024
…ive in the pybind11k repo.)

This rolls back pybind#4022 (including follow-on tweaks in other PRs).
rwgk pushed a commit that referenced this pull request Jul 20, 2024
…ive in the pybind11k repo.) (#5255)

This rolls back #4022 (including follow-on tweaks in other PRs).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants