Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate duplicate TLS keys for loader_life_support stack #3275

Merged
merged 3 commits into from
Sep 20, 2021

Conversation

jbms
Copy link
Contributor

@jbms jbms commented Sep 16, 2021

This revises the existing fix for
#2765 in
#3237 to reduce the amount of
TLS storage used.

The shared TLS key is stored in two different ways, depending on
PYBIND11_INTERNALS_VERSION. If PYBIND11_INTERNALS_VERSION == 4 (as is currently set), the TLS key is stored in the
internal::shared_data map to avoid breaking ABI compatibility. If
PYBIND11_INTERNALS_VERSION > 4, the TLS key is stored directly in
the internals struct.

Description

Suggested changelog entry:

I think no changelog entry is needed since this just revises the other change and there haven't been any releases since then. But if one is desired:

Reduced thread-local storage required for keeping alive temporary data for type conversion to one key per ABI version, rather than one key per extension module.  This makes the total thread-local storage required by pybind11 2 keys per ABI version.

@jbms jbms force-pushed the loader-life-support-reduce-tls-usage branch from 9daaa95 to c19b27b Compare September 16, 2021 17:30
@jbms
Copy link
Contributor Author

jbms commented Sep 16, 2021

I still don't know why it is failing on PyPy3 though.

@rwgk
Copy link
Collaborator

rwgk commented Sep 16, 2021 via email

@rwgk
Copy link
Collaborator

rwgk commented Sep 16, 2021

@charlesbeattie for awareness.

@jbms
Copy link
Contributor Author

jbms commented Sep 16, 2021

Yes, I fixed the MSVC warning.

I normally force push to update pull requests as otherwise you end up with a bunch of junk commits when merged. However I see that github does not seem to keep track of a revision history for pull requests, which is a bit unfortunate.

@rwgk
Copy link
Collaborator

rwgk commented Sep 16, 2021

I normally force push to update pull requests as otherwise you end up with a bunch of junk commits when merged.

We always squash-merge. It's better to not force push because then the evolution of the PR is obvious to reviewers.
@henryiii is there a good place to recommend not force pushing? (If you also think that's better?)

@henryiii
Copy link
Collaborator

I like force pushing and do it a lot. But if there's a history, then yes, you can avoid force pushing and we 95% of the time squash and merge. If we don't, it's because we need multiple commits for some reason, such as for a moved-and-changed file.

We could add some recommendations to the CONTRIBUTING file and/or the pull request description, not sure if people are reading those, though. Lots of PRs delete the changelog section, for example, which you are told not to do.

@henryiii
Copy link
Collaborator

PS: pull request reviews and such stick around. I'd generally structure my commit history to tell a story, rather than just try to always have one commit (it will be squashed usually). That story does not have to be linear with time, though, and usually is not - which I why I do a lot of force pushing.

@henryiii
Copy link
Collaborator

You can also do --fixup <TARGET>, which keeps history, but provides a clear path to combining things later.

/// advantage of any functionality/efficiency improvements that depend on the
/// newer ABI.
#ifndef PYBIND11_INTERNALS_VERSION
#define PYBIND11_INTERNALS_VERSION 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is manually increasing this "supported"? The text above seems to imply so, but there are no tests with this increased. I think we clearly need to indicate that this is a "use at your own risk" situation (since we might add more to the new ABI before officially increasing it), or add tests, or both. I like both. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning. As far as tests, I did test locally with both ABI versions. Adding the tests to CI would potentially double the number of test configurations, though, which may be undesirable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, which is one reason I'm a little hesitant about duplicating things here. I'd be fine with just 1-2 tests for this, though not ideal, still better than having a bunch of completely untested code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henryiii asked: Is manually increasing this "supported"?

@jbms Is the plan to patch the internals version when we import into the Google codebase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I assume for internal use at Google we would increase the version number. That could be done via a define in the BUILD file without having to patch the source code.

@henryiii
Copy link
Collaborator

This complicates the code quite a bit to pre-support this, and the next ABI breakage might be a full redesign of the internals structure. This needs to be done and this might be a good way to start adding these changes, but I'm worried about the extra complication of the code. Maybe it's manageable, but should be considered. There are several other planned ABI breakages as well, and I'd want/hope this would generalize to implementing those "early" too.

We need to think in terms of versions. We could do one release with the old ABI (2.8), then try a "small" breakage for 2.9, then look at an internals redesign for 3.0 (Jan 1+, also dropping Python 2). When will (optional) smart holders be ready, @rwgk? Also, when you turn on smart_holders, is it still ABI compatible?

@henryiii
Copy link
Collaborator

PyPy on macOS died with:

terminating with uncaught exception of type std::runtime_error: loader_life_support: internal error

@henryiii
Copy link
Collaborator

As I've said before, I'm not against ABI breakages if we need them. We could do a 2.8 very soon then break ABI.

@jbms jbms force-pushed the loader-life-support-reduce-tls-usage branch from c19b27b to b475928 Compare September 16, 2021 19:12
@jbms
Copy link
Contributor Author

jbms commented Sep 16, 2021

I added the "pre-implement" mechanism because

  • it provides a convenient way to keep track of future ABI-breaking changes that we would like to make. Otherwise, they are easy to forget about. Also, while implementing the original workaround, the relevant details are fresh in your mind --- if you have to go back to it later to implement the ABI-breaking version, you may forget an important detail.
  • for some uses maintaining ABI compatibility is not necessary, e.g. internally at Google we build everything from source and ensure only a single version of the library is used globally, therefore ABI compatibility is not an issue.

@jbms
Copy link
Contributor Author

jbms commented Sep 16, 2021

Regarding PyPy, yes i saw that error, which indicates that the TLS value does not match what was set in the constructor. That suggests some sort of TLS corruption, but I don't know why. I will have to look into it.

@rwgk
Copy link
Collaborator

rwgk commented Sep 16, 2021

Regarding PyPy, yes i saw that error, which indicates that the TLS value does not match what was set in the constructor. That suggests some sort of TLS corruption, but I don't know why. I will have to look into it.

We have many existing xfail for PyPy. I look at it as "generally not very stable". But of course this particular case could in theory unmask a real problem. Difficult to know.

@jbms
Copy link
Contributor Author

jbms commented Sep 16, 2021

ABI breakages do cause a fair bit of extra work for a package A that depends on py::cast<X> working for some type X defined by a dependency package B --- that means A is locked into using the same pybind11 ABI version as B (assuming both packages are obtained from PyPI). Furthermore, if A depends on an additional package C that also uses pybind11, now A, B, and C must all use the same pybind11 ABI version. As far as I know, PyPI does not provide a convenient way to release binary wheels with multiple pybind11 ABI versions for this sort of situation --- you would have to create an entirely separate package name to publish them under.

So I think ABI breakages should definitely not be done lightly, and they should be done as infrequently as possible.

@charlesbeattie
Copy link

Looks good to me - I am very happy with the technical aspects of this solution. Small nit: I'm not a fan of pre-implementing future changes behind a #define. I read the discussion above and though I agree with @jbms, an alternative would be to have the change sitting in a abi-breaking branch that can have it's own testing etc. I am happy for this change to go in as is and discuss an alternative ABI changing strategy at another time.

@jbms
Copy link
Contributor Author

jbms commented Sep 16, 2021

I'm not reproducing the PyPy3 failure locally with PyPy 7.3.5 (Python 3.7.10) on Debian. I see that CI is using PyPyv3.6 though.

There is different TLS logic for <3.7 and >=3.7.

I'll have to try building PyPy3.6 from source.

@henryiii
Copy link
Collaborator

If you can't get it working with PyPy3.6, I think we have a "latest version supported only" policy for PyPy, and PyPy dropped PyPy3.6 around 7.3.4 or so. I'd rather know why it's breaking and not break it if possible, but it doesn't have to be a blocker.

@jbms I agree with both points, which is why I was not extremely unhappy with the change, but we have limited resources, and this adds a burden to maintainers until the ABI breakage is applied to work with two branches of code, one of which is untested or less tested. We need at least 1-2 test runs with this define enabled (in our repo, not just Google integration tests, which do also help, though), otherwise I can promise this will break before we activate the next ABI version.

In conda-forge, there's an ABI management system, though many packages do not use the external pybind11 package there, so it doesn't help all packages. In PyPI, users will have to pin, etc. This is one reason I'd really like to see if we can sit down and design a portable (not compiler dependent) ABI that ideally is versioned (so new additions do not need to break compatibility usually). That's a potential target for pybind11 3 early next year. As it is, a lot is added to the ABI name mangling, so this will simply not allow the cast from a mismatched extension, rather than doing something bad.

But, yes, we've held off on bumping the ABI since 2.4 IIRC because it will be messy.

@henryiii
Copy link
Collaborator

@charlesbeattie Branches are likely to become stale very quickly; two planned changes will be moving code all over the place, the test restructuring and the optional pre-compile. Normally, yes, that would probably be the correct path to go.

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, based on @charlesbeattie's review.

@henryiii wrote:

We need at least 1-2 test runs with this define enabled (in our repo, not just Google integration tests, which do also help, though), otherwise I can promise this will break before we activate the next ABI version.

I agree that would be ideal, but I'd also be OK without testing the define via GHA, because Google-internally it will not just be integration testing, the #ifdefed code will actually be the production code. We'll keep it healthy for sure. I'd be extra diligent quickly integrating all changes from master (I usually do that very quickly anyway).

Copy link
Collaborator

@henryiii henryiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PyPy failure needs to be understood, and if PyPy3.6 can't be supported, we need to clearly note that, etc. Also, we need at least one test running in our CI with this enabled. We can't rely on Google Testing to catch things a day later and then revert PRs. I can add that job in CI if it helps. I'm hard-blocking until then.

@rwgk
Copy link
Collaborator

rwgk commented Sep 16, 2021 via email

@henryiii
Copy link
Collaborator

(I wanted to log my approval before heading out for a trip.)

No problem, just wanted to make sure this doesn't get merged before it's done, once it is, your review will help it go in. :)

@jbms jbms force-pushed the loader-life-support-reduce-tls-usage branch from b475928 to 978eab1 Compare September 17, 2021 01:16
@jbms
Copy link
Contributor Author

jbms commented Sep 17, 2021

I fixed the pypy failure --- turns out the pre 3.7 TLS API in pypy follows the CPython 2 semantics, not the CPython 3 semantics. This may have also caused problems with certain uses of gil_scoped_acquire / gil_scoped_release.

I also modified ci.yml to run test_call_policies, test_gil_scoped, and test_thread with the unstable ABI.

@jbms jbms force-pushed the loader-life-support-reduce-tls-usage branch from 0f6318a to d99fb6d Compare September 17, 2021 01:19
@jbms jbms requested a review from henryiii September 17, 2021 01:20
@jbms
Copy link
Contributor Author

jbms commented Sep 17, 2021

By the way, the TLS implementation in PyPy <3.7 is absolutely terrible (I think it was copied from python 2): there is a single global linked list of (thread_id, key, value) entries protected by a mutex.

@jbms jbms force-pushed the loader-life-support-reduce-tls-usage branch 3 times, most recently from 4af61af to fdeb4cb Compare September 17, 2021 06:14
This revises the existing fix for
pybind#2765 in
pybind#3237 to reduce the amount of
TLS storage used.

The shared TLS key is stored in two different ways, depending on
`PYBIND11_INTERNALS_VERSION`.  If `PYBIND11_INTERNALS_VERSION ==
4` (as is currently set), the TLS key is stored in the
`internal::shared_data` map to avoid breaking ABI compatibility.  If
`PYBIND11_INTERNALS_VERSION > 4`, the TLS key is stored directly in
the `internals` struct.
@jbms jbms force-pushed the loader-life-support-reduce-tls-usage branch from fdeb4cb to ed50e87 Compare September 17, 2021 14:57
@jbms
Copy link
Contributor Author

jbms commented Sep 17, 2021

All tests are passing now, except for a spurious failure downloading dependencies.

@rwgk
Copy link
Collaborator

rwgk commented Sep 19, 2021

FYI: I created scratch PR #3282 for porting this PR to the smart_holder branch and then used that to initiate Google-internal global testing. Results expected in ~8 hours from now.

I ran into a merge conflict in internals.h. I resolved it under PR #3282, but it needs to be broken out as a separate small smart_holder PR (later).

@rwgk
Copy link
Collaborator

rwgk commented Sep 20, 2021

Google-internal global testing was successful. I will merge this now, based on Charlie's and Henry's reviews, then do the required work to merge this into the smart_holder branch.

@rwgk rwgk merged commit 14976c8 into pybind:master Sep 20, 2021
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Sep 20, 2021
@rwgk
Copy link
Collaborator

rwgk commented Sep 20, 2021

@jbms: Could you please provide a suggested changelog entry (PR description)?

@rwgk
Copy link
Collaborator

rwgk commented Sep 20, 2021

For completeness: Before initiating the global testing, I manually ran asan, msan, tsan. Nothing changed: asan and msan are clean for all pybind11/tests (caveat: the embedding tests are not covered Google-internally). tsan still has an error for test_gil_scoped, as reported a while ago under #2754.

rwgk pushed a commit to rwgk/pybind11 that referenced this pull request Sep 20, 2021
rwgk pushed a commit that referenced this pull request Sep 20, 2021
@jbms
Copy link
Contributor Author

jbms commented Sep 20, 2021

@rwgk I updated the PR description to add a changelog entry.

@henryiii henryiii removed the needs changelog Possibly needs a changelog entry label Sep 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants