-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure 'pip wheel' can create .so artifacts deterministically #6505
Comments
This string is embedded in the debug information:
It looks like this is a common issue:
It may also be related to pypa/wheel#248. Using a deterministic build directory name would leave us open to denial-of-service and/or concurrency-related problems. Using e.g. the GCC flag |
Thanks for your input Christopher.
The latter makes sense to me, but how exactly is DOS at play if you use a consistent build dir?
For my use case, the input of the Bazel / |
An unprivileged user on the same host can create the build directory and set its permissions to 700, which prevents you from building. |
In a simple test, I was able to get consistent builds by exporting |
Also, make more reproducible builds. By default, pip injects a symbol with its build directory into compiled files. See pypa/pip#6505. This can be avoided by preventing debug symbols by adding `CFLAGS=-g0`. Additionally, the wheels contain a few files with the current time stamp rather than the time given by `SOURCE_DATE_EPOCH`, so the perl tool `strip-nodeterminism` is used to redate the files within the wheels to the `SOURCE_DATE_EPOCH`.
Also, make more reproducible builds. By default, pip injects a symbol with its build directory into compiled files. See pypa/pip#6505. This can be avoided by preventing debug symbols by adding `CFLAGS=-g0`. Additionally, the wheels contain a few files with the current time stamp rather than the time given by `SOURCE_DATE_EPOCH`, so the perl tool `strip-nodeterminism` is used to redate the files within the wheels to the `SOURCE_DATE_EPOCH`.
This is an issue not just with CC: @bdrewery |
Interesting... this is a common pattern that can't be overridden:
From
It's unfortunate, but it looks like the
Sidenote: hmmm... the notion of tempfiles in cpython as of 3.9 seems insecure; it bypasses better security practices by reimplementing mkstemp(3), etc, with a deterministic algorithm/template consisting of only 8 "random" characters seeded by the PID :(. |
I've fallen back to invoking It's really unfortunate that there isn't a better story around this with |
Is this another argument in favor of in-tree builds (#7555)? |
It is slightly different, since building in the source tree does not necessarily mean the built artifacts are in the source tree. It is only by tradition the most popular back-end (setuptools) does this. Having in-tree builds would happen to solve the immediate problem, but IMO the ultimate solution to this problem would be to introduce a flag to PEP 517 that can tell the back-end where they must generate the artifact in, and create a flag in pip to let user provide that information. |
Currently, pip randomly assigns directory names when it builds Python sdists into bdists. This can result in randomized file paths being embedded into the build output (usually in debug symbols, but potentially in other places). The ideal solution would be to trim the front (random part) of the file path off, leaving the remaining (deterministic) part to embed in the binary. Doing so would require reaching deep into the configuration of whatever compiler/linker pip happens to be using (e.g. gcc, clang, rustc, etc.). This option, on the other hand, doesn't require modifying the internals of Python packages. In this patch we make it so that pip's randomly assigned directory paths are instead generated from a deterministic counter. Doing so requires exclusive access to TMPDIR, because otherwise other programs (likely other executions of `pip`) will attempt to create directories of the same name. For that reason, the feature only activates when SOURCE_DATE_EPOCH is set. For more discussion (and prior art) in this area, see: * https://github.com/NixOS/nixpkgs/pull/102222/files * pypa#6505
Another piece of software that breaks because of non-deterministic paths is sccache, which requires paths to match in order to get a cache hit. Also, the DOS thing probably isn't important when builds are containerized. |
So... pip no longer performs local builds in a non-deterministic location, so if users are seeing non-deterministic build outputs, it's likely not pip but instead the build-backend that the package being built is using. |
Hmm, but in practical terms that means the issue still occurs when e. g. installing packages / building wheels from git URLs EDIT: see below |
pip needs to download source distributions (from VCS or archives) to temporary directories because it may not even know their name in advance, and when the name is known, it may need to download and prepare metadata for different versions of the same project during the resolution process. One thing we could imagine is moving/renaming the temporary unpack directory to a predictable location (say @vlad-ivanov-name I don't think the code you highlight above is relevant because it is merely the target directory where the built wheel must be stored, which should not be relevant to the content of the wheel. |
That makes sense, thank you. For the purpose of CI caching, where paths being deterministic 90% of the time is good enough already, I'm considering monkey-patching
I think predictable and deterministic are a bit different, again for caching having deterministic paths is enough even if the way of deriving those paths is convoluted |
Is this not a problem for the build backend? I'm struggling to see why the problem here isn't that the backend embeds the full pathname into the output, rather than just a relative name. |
Yes this should first be addressed in build backends and now that |
Well... At the end of the day, while building things in a reproducible manner is a valueable activity, pip is not a tool that can enforce that right now. Note that the wheels are built by a "build backend" such as https://github.com/pypa/setuptools/ or https://github.com/pypa/flit/ or https://github.com/pypa/hatch. All that pip is doing is calling them and copy-pasting their artifacts over. Basically, I don't think the guarentee that's being requested here is something that pip itself can provide, on its own anyway. There's tooling available to build wheels in a reproducible manner, like https://github.com/kushaldas/asaman -- which uses pip under the hood and sets up everything for the relevant build-backends to build things in a reproducible manner (assuming they follow https://reproducible-builds.org/ model). |
That is fair; certainly, pip alone won't be able to manage all possible caveats (at the end of the day one could always put However, I certainly could see pip assisting with it to some degree, for example, by providing a way to change the behaviour of the I do understand that to some degree, it is the problem of build systems, cache wrappers etc; but in some cases, those have no choice but to depend on absolute paths as the mechanism of detecting whether the path would affect the output would be too complicated and unreliable to implement (example: mozilla/sccache#35).
|
Currently, pip randomly assigns directory names when it builds Python sdists into bdists. This can result in randomized file paths being embedded into the build output (usually in debug symbols, but potentially in other places). The ideal solution would be to trim the front (random part) of the file path off, leaving the remaining (deterministic) part to embed in the binary. Doing so would require reaching deep into the configuration of whatever compiler/linker pip happens to be using (e.g. gcc, clang, rustc, etc.). This option, on the other hand, doesn't require modifying the internals of Python packages. In this patch we make it so that pip's randomly assigned directory paths are instead generated from a deterministic counter. Doing so requires exclusive access to TMPDIR, because otherwise other programs (likely other executions of `pip`) will attempt to create directories of the same name. For that reason, the feature only activates when SOURCE_DATE_EPOCH is set. For more discussion (and prior art) in this area, see: * https://github.com/NixOS/nixpkgs/pull/102222/files * pypa#6505
Currently, pip randomly assigns directory names when it builds Python sdists into bdists. This can result in randomized file paths being embedded into the build output (usually in debug symbols, but potentially in other places). The ideal solution would be to trim the front (random part) of the file path off, leaving the remaining (deterministic) part to embed in the binary. Doing so would require reaching deep into the configuration of whatever compiler/linker pip happens to be using (e.g. gcc, clang, rustc, etc.). This option, on the other hand, doesn't require modifying the internals of Python packages. In this patch we make it so that pip's randomly assigned directory paths are instead generated from a deterministic counter. Doing so requires exclusive access to TMPDIR, because otherwise other programs (likely other executions of `pip`) will attempt to create directories of the same name. For that reason, the feature only activates when SOURCE_DATE_EPOCH is set. For more discussion (and prior art) in this area, see: * https://github.com/NixOS/nixpkgs/pull/102222/files * pypa#6505
Currently, pip randomly assigns directory names when it builds Python sdists into bdists. This can result in randomized file paths being embedded into the build output (usually in debug symbols, but potentially in other places). The ideal solution would be to trim the front (random part) of the file path off, leaving the remaining (deterministic) part to embed in the binary. Doing so would require reaching deep into the configuration of whatever compiler/linker pip happens to be using (e.g. gcc, clang, rustc, etc.). This option, on the other hand, doesn't require modifying the internals of Python packages. In this patch we make it so that pip's randomly assigned directory paths are instead generated from a deterministic counter. Doing so requires exclusive access to TMPDIR, because otherwise other programs (likely other executions of `pip`) will attempt to create directories of the same name. For that reason, the feature only activates when SOURCE_DATE_EPOCH is set. For more discussion (and prior art) in this area, see: * https://github.com/NixOS/nixpkgs/pull/102222/files * pypa#6505
Currently, pip randomly assigns directory names when it builds Python sdists into bdists. This can result in randomized file paths being embedded into the build output (usually in debug symbols, but potentially in other places). The ideal solution would be to trim the front (random part) of the file path off, leaving the remaining (deterministic) part to embed in the binary. Doing so would require reaching deep into the configuration of whatever compiler/linker pip happens to be using (e.g. gcc, clang, rustc, etc.). This option, on the other hand, doesn't require modifying the internals of Python packages. In this patch we make it so that pip's randomly assigned directory paths are instead generated from a deterministic counter. Doing so requires exclusive access to TMPDIR, because otherwise other programs (likely other executions of `pip`) will attempt to create directories of the same name. For that reason, the feature only activates when SOURCE_DATE_EPOCH is set. For more discussion (and prior art) in this area, see: * https://github.com/NixOS/nixpkgs/pull/102222/files * pypa#6505
vagrants-iMac:electrum vagrant$ ./contrib/osx/compare_dmg dist/electrum-4.3.0-ghost43.dmg /Users/vagrant/Desktop/electrum-4.3.0-thomas1.dmg [...] Extracting signatures from release app... Created mac_extracted_sigs.tar.gz Applying extracted signatures to unsigned app... Done. .app with sigs applied is at: /tmp/electrum_compare_dmg/signed_app ++ diff -qr /tmp/electrum_compare_dmg/signed_app /tmp/electrum_compare_dmg/dmg2 + diff='Files /tmp/electrum_compare_dmg/signed_app/Electrum.app/Contents/MacOS/cbor/_cbor.cpython-39-darwin.so and /tmp/electrum_compare_dmg/dmg2/Electrum.app/Contents/MacOS/cbor/_cbor.cpython-39-darwin.so differ' + diff='diff errored' + set +x diff errored DMGs do *not* match. failure user@user-VirtualBox:~/wspace/tmp$ vbindiff comp/signed_app/_cbor.cpython-39-darwin.so comp/dmg2/_cbor.cpython-39-darwin.so comp/signed_app/_cbor.cpython-39-darwin.so 0000 6AC0: 00 5F 50 79 49 6E 69 74 5F 5F 63 62 6F 72 2E 6D ._PyInit __cbor.m 0000 6AD0: 6F 64 65 66 00 5F 43 62 6F 72 4D 65 74 68 6F 64 odef._Cb orMethod 0000 6AE0: 73 00 2F 70 72 69 76 61 74 65 2F 76 61 72 2F 66 s./priva te/var/f 0000 6AF0: 6F 6C 64 65 72 73 2F 35 36 2F 64 38 36 70 35 39 olders/5 6/d86p59 0000 6B00: 37 31 31 67 7A 63 62 38 73 31 71 37 31 36 78 31 711gzcb8 s1q716x1 0000 6B10: 6C 63 30 30 30 30 67 6E 2F 54 2F 70 69 70 2D 69 lc0000gn /T/pip-i 0000 6B20: 6E 73 74 61 6C 6C 2D 36 6D 69 36 68 6C 75 65 2F nstall-6 mi6hlue/ comp/dmg2/_cbor.cpython-39-darwin.so 0000 6AC0: 00 5F 50 79 49 6E 69 74 5F 5F 63 62 6F 72 2E 6D ._PyInit __cbor.m 0000 6AD0: 6F 64 65 66 00 5F 43 62 6F 72 4D 65 74 68 6F 64 odef._Cb orMethod 0000 6AE0: 73 00 2F 70 72 69 76 61 74 65 2F 76 61 72 2F 66 s./priva te/var/f 0000 6AF0: 6F 6C 64 65 72 73 2F 37 68 2F 70 33 30 7A 5F 74 olders/7 h/p30z_t 0000 6B00: 79 31 35 30 31 32 70 66 5F 33 64 79 78 62 73 39 y15012pf _3dyxbs9 0000 6B10: 33 34 30 30 30 30 67 6E 2F 54 2F 70 69 70 2D 69 340000gn /T/pip-i 0000 6B20: 6E 73 74 61 6C 6C 2D 30 68 64 39 63 35 6D 65 2F nstall-0 hd9c5me/ related: pypa/pip#6505
We compile from tar.gz, instead of using pre-built binary wheels from PyPI. (or if the dep is pure-python, use tar.gz instead of "source-only" wheel) ----- Some unorganised things below for future reference. ``` $ dsymutil -dump-debug-map dist1/hid.cpython-39-darwin.so warning: (x86_64) /private/var/folders/1n/zc14m3td0rg4nt0ftklmm7z00000gn/T/pip-install-bm88zvc1/hidapi_cd307bc31ab34252b77d11d6d7212fc5/build/temp.macosx-10.9-x86_64-3.9/hid.o unable to open object file: No such file or directory warning: (x86_64) /private/var/folders/1n/zc14m3td0rg4nt0ftklmm7z00000gn/T/pip-install-bm88zvc1/hidapi_cd307bc31ab34252b77d11d6d7212fc5/build/temp.macosx-10.9-x86_64-3.9/hidapi/mac/hid.o unable to open object file: No such file or directory --- triple: 'x86_64-apple-darwin' binary-path: 'dist1/hid.cpython-39-darwin.so' ... ``` ``` $ nm -pa dist1/hid.cpython-39-darwin.so ``` - https://stackoverflow.com/questions/10044697/where-how-does-apples-gcc-store-dwarf-inside-an-executable - pypa/pip#6505 - pypa/pip#7808 (comment) - NixOS/nixpkgs#91272 - cython/cython#1576 - https://github.com/cython/cython/blob/9d2ba1611b28999663ab71657f4938b0ba92fe07/Cython/Compiler/ModuleNode.py#L913
This worked for me on Linux for all packages encountered, but only for some on macOS. I am atm only interested in the E.g.
|
We compile from tar.gz, instead of using pre-built binary wheels from PyPI. (or if the dep is pure-python, use tar.gz instead of "source-only" wheel) ----- Some unorganised things below for future reference. ``` $ dsymutil -dump-debug-map dist1/hid.cpython-39-darwin.so warning: (x86_64) /private/var/folders/1n/zc14m3td0rg4nt0ftklmm7z00000gn/T/pip-install-bm88zvc1/hidapi_cd307bc31ab34252b77d11d6d7212fc5/build/temp.macosx-10.9-x86_64-3.9/hid.o unable to open object file: No such file or directory warning: (x86_64) /private/var/folders/1n/zc14m3td0rg4nt0ftklmm7z00000gn/T/pip-install-bm88zvc1/hidapi_cd307bc31ab34252b77d11d6d7212fc5/build/temp.macosx-10.9-x86_64-3.9/hidapi/mac/hid.o unable to open object file: No such file or directory --- triple: 'x86_64-apple-darwin' binary-path: 'dist1/hid.cpython-39-darwin.so' ... ``` ``` $ nm -pa dist1/hid.cpython-39-darwin.so ``` - https://stackoverflow.com/questions/10044697/where-how-does-apples-gcc-store-dwarf-inside-an-executable - pypa/pip#6505 - pypa/pip#7808 (comment) - NixOS/nixpkgs#91272 - cython/cython#1576 - https://github.com/cython/cython/blob/9d2ba1611b28999663ab71657f4938b0ba92fe07/Cython/Compiler/ModuleNode.py#L913
We compile from tar.gz, instead of using pre-built binary wheels from PyPI. (or if the dep is pure-python, use tar.gz instead of "source-only" wheel) ----- Some unorganised things below for future reference. ``` $ dsymutil -dump-debug-map dist1/hid.cpython-39-darwin.so warning: (x86_64) /private/var/folders/1n/zc14m3td0rg4nt0ftklmm7z00000gn/T/pip-install-bm88zvc1/hidapi_cd307bc31ab34252b77d11d6d7212fc5/build/temp.macosx-10.9-x86_64-3.9/hid.o unable to open object file: No such file or directory warning: (x86_64) /private/var/folders/1n/zc14m3td0rg4nt0ftklmm7z00000gn/T/pip-install-bm88zvc1/hidapi_cd307bc31ab34252b77d11d6d7212fc5/build/temp.macosx-10.9-x86_64-3.9/hidapi/mac/hid.o unable to open object file: No such file or directory --- triple: 'x86_64-apple-darwin' binary-path: 'dist1/hid.cpython-39-darwin.so' ... ``` ``` $ nm -pa dist1/hid.cpython-39-darwin.so ``` - https://stackoverflow.com/questions/10044697/where-how-does-apples-gcc-store-dwarf-inside-an-executable - pypa/pip#6505 - pypa/pip#7808 (comment) - NixOS/nixpkgs#91272 - cython/cython#1576 - https://github.com/cython/cython/blob/9d2ba1611b28999663ab71657f4938b0ba92fe07/Cython/Compiler/ModuleNode.py#L913
We compile from tar.gz, instead of using pre-built binary wheels from PyPI. (or if the dep is pure-python, use tar.gz instead of "source-only" wheel) ----- Some unorganised things below for future reference. ``` $ dsymutil -dump-debug-map dist1/hid.cpython-39-darwin.so warning: (x86_64) /private/var/folders/1n/zc14m3td0rg4nt0ftklmm7z00000gn/T/pip-install-bm88zvc1/hidapi_cd307bc31ab34252b77d11d6d7212fc5/build/temp.macosx-10.9-x86_64-3.9/hid.o unable to open object file: No such file or directory warning: (x86_64) /private/var/folders/1n/zc14m3td0rg4nt0ftklmm7z00000gn/T/pip-install-bm88zvc1/hidapi_cd307bc31ab34252b77d11d6d7212fc5/build/temp.macosx-10.9-x86_64-3.9/hidapi/mac/hid.o unable to open object file: No such file or directory --- triple: 'x86_64-apple-darwin' binary-path: 'dist1/hid.cpython-39-darwin.so' ... ``` ``` $ nm -pa dist1/hid.cpython-39-darwin.so ``` - https://stackoverflow.com/questions/10044697/where-how-does-apples-gcc-store-dwarf-inside-an-executable - pypa/pip#6505 - pypa/pip#7808 (comment) - NixOS/nixpkgs#91272 - cython/cython#1576 - https://github.com/cython/cython/blob/9d2ba1611b28999663ab71657f4938b0ba92fe07/Cython/Compiler/ModuleNode.py#L913
Currently, pip randomly assigns directory names when it builds Python sdists into bdists. This can result in randomized file paths being embedded into the build output (usually in debug symbols, but potentially in other places). The ideal solution would be to trim the front (random part) of the file path off, leaving the remaining (deterministic) part to embed in the binary. Doing so would require reaching deep into the configuration of whatever compiler/linker pip happens to be using (e.g. gcc, clang, rustc, etc.). This option, on the other hand, doesn't require modifying the internals of Python packages. In this patch we make it so that pip's randomly assigned directory paths are instead generated from a deterministic counter. Doing so requires exclusive access to TMPDIR, because otherwise other programs (likely other executions of `pip`) will attempt to create directories of the same name. For that reason, the feature only activates when SOURCE_DATE_EPOCH is set. For more discussion (and prior art) in this area, see: * https://github.com/NixOS/nixpkgs/pull/102222/files * pypa#6505
Given the state of the ecosystem and the devolution of build behaviours, I'm gonna close this out and say that you should help improve asaman if you want this. |
Linking this here for people looking for possible solutions: https://discuss.python.org/t/introducing-asaman-a-tool-to-bulid-reproducible-wheels/10932 |
What's the problem this feature will solve?
The Bazel build system has the major selling point of supporting both local and remote-caching.
In order for that caching to work though, Bazel targets must be built deterministically so that the same target always has the same content-addressable hash.
Currently
pip wheel
is non-deterministic, so our Python Bazel targets will cache miss if they depend on something built withpip wheel
.Describe the solution you'd like
The following is a subset of the build outputs of the
PyYAML
package. Of the build outputs, it is theRECORD
files and the_yaml.cpython-36m-x86_64-linux-gnu.so
shared object file that have non-deterministic hashes build to build. I have inspected theRECORD
file and found that it contains the hash of the.so
file, so it is non-deterministic because of the.so
file, and I think only because of that.So the problem is the
.so
file.I ran the
strings
program on the.so
file and found this printable string:/tmp/pip-wheel-_bd8v3f2/pyyaml
. That is coming from here:pip/src/pip/_internal/wheel.py
Line 649 in 6af9de9
So while I found other differences between different
_yaml.cpython-36m-x86_64-linux-gnu.so
, this tmp directory usage leaking in itself is sufficient to break determinism.Additional context
rules_python
issue discussing this problem: bazelbuild/rules_python#154rules_python
repo: https://github.com/bazelbuild/rules_pythonThe text was updated successfully, but these errors were encountered: