-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increasing pip's & PyPI's metadata strictness #264
Comments
I linked to this on Discourse to give people a heads-up. Some questions:
Who needs to do this? This will affect Twine & Warehouse in particular, right?
Would this be in |
We could do it in I think one problem with the "PyPI raises a warning" approach is that IIUC twine has no mechanism to display any such warning. @di and @dstufft would know better than me whether adding such a capability is desirable for this or other reasons. |
I have a branch started on this and intend it to be a part of packaging so it can be used from multiple sources. I think starting by using it in warehouse would make sense. |
The upload API does not support a warning mechanism, only error. Arguably it's better to send a warning via email though? Or at least one could make the argument that it is. Lots of people publish in an automated fashion and won't ever see a command line warning anyways. |
Warnings from PyPI via email make a lot of sense to me. It's definitely a better option than adding a warnings mechanism that twine then exposes to the user. We should also use something (like packaging) which can also be used across projects to do the metadata validation. |
Not really related to packaging itself but I wonder if PYPI could start archiving obsolete/unmaintained packages from the main index. PYPI UX could be improved if we would have a curated index with currently maintained packages. At this moment even if you search for a "foo" package, you will get a very poor search result as the default listing (relevance) does not display the last release date. Maybe we could use the metadata compliance as a way to filter the package indexes, motivating people to migrate? Regarding twine upload warnings, I am even more drastic: default to error and allow a temporary bypass option. Sadly 99% of people not even read the warnings, so make them error. The only trick is to include a link to ticket where people can see how to fix it and also comment. |
I've released a couple of wheels with a build number. Every layer removed from the source tends to want its own version number; RPM has epochs. Alternatively you could make 7 corrections by uploading a py30-none-any wheel and then incrementing the tag all the way to py38-none-any. |
The biggest problem with any sort of system that tries to determine if something is obsolete/unmaintained or not... is how do you actually determine if something is obsolete and/or unmaintained and ensure that you don't get false positives for software that is just "done" and just doesn't need any further updates? |
@ssbarnea I'd like to keep this issue focused on the metadata strictness issue. For more on archiving unmaintained projects or excluding them from search or otherwise making frequently-maintained packages easier to find, you might want to follow up in pypi/warehouse#4004 , pypi/warehouse#1388 , pypi/warehouse#4319 , pypi/warehouse#4021 , or pypi/warehouse#1971 . Thanks for your ideas! |
@dstufft is "Disallow runs of special characters in project names" pypi/warehouse#469 or "Clean database of UNKNOWN and validates against it" pypi/warehouse#69 part of what's necessary, or part of what we want to do, in this increasing-metadata-strictness work? |
@crwilcox perhaps you'd like to take a look at pypi/warehouse#194 where people discuss what automated checks, including on metadata, they'd like performed on uploads to PyPI. |
Coming from @di's comment pypa/twine#430 (comment), by "like packaging" here, I wasn't suggesting we should add this to packaging itself. Rather, I meant that we should have a well-scoped library that does just this one thing -- package validation.
|
I was proposing adding this to |
One of the TODOs here is to finish the pip solver. The Python Software Foundation's Packaging Working Group has secured funding to help finish the new dependency resolver, and is seeking two contract developers to aid the existing maintainers for several months. Please take a look at the request for proposals and, if you're interested, apply by 22 November 2019. And please spread the word to freelance developers and consulting firms. |
We're progressing in the pip resolver work; in a recent meeting, @uranusjr @pfmoore and @pradyunsg started talking about how better Warehouse metadata would help pip with its new resolver, which we'd like to roll out around May. So pypi/warehouse#726, pypa/packaging#147, pypa/twine#430, and pypa/setuptools#1562 would really help; would anyone like to step up and help get those moving? |
How is pypi/warehouse#726 related to metadata strictness here? |
One aspect I'm aware of is that standardised two-phase upload makes it easier to test installation metadata accuracy prior to release, since it also allows the testing process to be standardised. |
+1 to @ncoghlan. Two-phase upload/package preview gives us
|
I think it's probably worth making a distinction here: two-phase upload could help prevent metadata that is "not the author's intention, but technically compliant" (such as typos, incorrect I think a number of things listed here as "metadata inaccuracy" (like incorrect dependencies, invalid manylinux wheels) might be true, but would require some form of code execution / introspection / auditing (e.g. For things like "warn on missing Which of these is causing the biggest issue for the resolver work? |
This GitHub issue bundles together the entire bunch of metadata concerns as they were voiced during the (only? main?) metadata-related topic at the Pacakging Mini-Summit 2019 -- most of which ended up being metadata validation related -- however, as noted below the originally proposed topic was oriented around the resolver's operation/UX. Additionally, two-phase uploads also came up both the resolver-related planning calls; in the discussion of how rolling out the changes to get better metadata on PyPI could be done (in the early Jan call, as a long-term strategy for improving the state of metadata on PyPI) and how it could catch release issues (like pip 20.0, in the more recent call). IIUC, a comment by me in a resolver-planning meeting recently prompted the follow-up here:
The main issue for the resolver is the metadata inaccuracies, which we really have no option other than to deal with directly in the resolver somehow. A new resolver has to work with the already-published-with-inaccurate-metadata packages -- so even if we somehow end up with a build farm to build packages on PyPI in the future and we can make sure every new release actually presents correct metadata -- that's not gonna change that the resolver still needs to deal with potentially-incorrect metadata from past releases (unless we're planning to tackle the super untractable problem of back-filling this metadata). Metadata validation isn't super relevant to the resolution process (directly anyway). Looking back, the original topic suggestion was more oriented around exactly this -- dealing with inaccuracies in the PyPI metadata in the resolution process. The actual discussion at the event, transitioned to be more around other parts of the workflow, which generate/publish metadata (vs use it, like in the resolver) because of the audience-driven nature of the discussion. The PyCon 2019 notes don't seem to mention this, so I'm going off my memory now: the reason the resolver is mentioned in the summary-notes above is that during the discussion at the summit "should pip stop when it is installing a package that my break my environment" -- we discussed that theoretically, a pip-with-better-resolver would come up with a solution that the current pip would not be able to find in those situations. (I remember this, because we'd joked about the resolver's complexity when we added it to the action items: https://twitter.com/EWDurbin/status/1125447285272395776/photo/3) :) (and, that makes it 4am IST) |
It wouldn't be that intractable to backfill metadata for wheels-- but for sdists there's not much we're going to be able to do. Would the resolver be smart enough to be able to know, in a hypothetical scenario, that it can prefetch dependency information for a wheel, but not for a sdist? Would that help? |
Yep yep -- we can do that on the "pip side" of the abstractions. |
pypi/warehouse#3889 is the issue requesting that Warehouse reject package uploads that lack a Requires-Python. |
A followup after some discussion in IRC a few days ago. On some open issues and TODOs:
Still working on this.
We'd love help with this feature.
Simple API for yanking; underway. The To-do items:
Relevant work is in progress -- see pypa/packaging#147 (comment) .
This is now done.
I don't know whether anyone has made progress on this. TODO #
In progress.
We need to further discuss lack of
I think we still need to discuss this in pypi/warehouse#194 .
Per the IRC discussion, it sounds like this may or may not be necessary, depending on whether we enforce more specific rules about metadata that must be included in packages, minimum metadata versions, etc.
Again, we would love help implementing package preview/staged releases. |
@alanbato is working on package preview/staged releases (now "draft releases"), and the yanking feature is now implemented pypi/warehouse#5837. We still need further discussion and help with
|
I'm using wheel build numbers for an experimental re-compressed wheels repository and they are working correctly. pip knows the re-compressed wheels found on that |
What @dholth said made me wonder if in the future we may be able to repackage wheels, maybe even adding extra constraints that prevent incompatibilities with dependencies that where released after the initial wheel was published. Maybe that would be too much, but the idea of being able to have an increasing packaging number (aka release number), is great. |
@ssbarnea We'd have to be extremely careful here from a security point of view. It would be bad, for example, if an untrusted party were able to repackage the wheel for numpy, claiming it's just recompressed to save bandwidth, but in fact they also introduce a new dependency on Obviously any new repository supplying repackaged wheels would be opt-in, so the exposure is limited to people who do opt in, but as a term, "repackaging" implies no changes to what gets installed on the user's machine, and we don't (yet) have mechanisms to ensure that. |
You'd better trust the mirror however it would be possible to check all the contained file hashes against the original or otherwise go wild adding security features.
…On Thu, Apr 23, 2020, at 5:37 AM, Paul Moore wrote:
@ssbarnea <https://github.com/ssbarnea> We'd have to be extremely careful here from a security point of view. It would be bad, for example, if an untrusted party were able to repackage the wheel for numpy, claiming it's just recompressed to save bandwidth, but in fact they also introduce a new dependency on `maliciouspackage==1.0`.
Obviously any new repository supplying repackaged wheels would be opt-in, so the exposure is limited to people who do opt in, but as a term, "repackaging" implies no changes to what gets installed on the user's machine, and we don't (yet) have mechanisms to ensure that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#264 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABSZEQ45ARCTM5N5TKD4BTROAD6NANCNFSM4HXQGZUA>.
|
Recompressing wheels is inherently incompatible with a threat model where an attacker has the ability to respond to client requests. This is a threat model we're protecting PyPI from, with future security enhancements: https://www.python.org/dev/peps/pep-0458/#threat-model. Note that this PEP is accepted and, as far as I know, there is funding for implementation of this functionality as well. If the user should download multiple wheels (like the original wheel, for checking contained file hashes against), I'm pretty sure we've thrown away any bandwidth gains we'd have made. :) |
This bug is really about whether the build tags work. They do.
If you were willing to sign the RECORD (or any document containing hashes of all the individual files within the package) and if you were willing to let the recompressed wheel be unpacked before it was totally verified, then your extra bandwidth would be the list of filenames and hashes of the original. Not the entire original wheel.
If you're more worried about making a silly mistake than withstanding a sophisticated attack then you're in a much better position. You can use a HTTP range request to download the original wheel's RECORD from pypi, and compare it with what you got from the recompressor.
You could have endless fun going down the rabbit hole of having the mirror say "I zipped this correctly" with one signature in which case you would feel confident enough to run unzip on the wheel, and then make sure the individual files were still intact.
No matter the security it's really common to have your own source of packages that you want overlaid on top of the public one.
…On Thu, Apr 23, 2020, at 4:40 PM, Pradyun Gedam wrote:
> You'd better trust the mirror
Recompressing wheels is inherently incompatible with a threat model where an attacker has the ability to respond to client requests.
This is a threat model we're protecting PyPI from, with future security enhancements: https://www.python.org/dev/peps/pep-0458/#threat-model. Note that this PEP is accepted and, as far as I know, there is funding for implementation of this functionality as well.
If the user should download multiple wheels (like the original wheel, for checking contained file hashes against), I'm pretty sure we've thrown away any bandwidth gains we'd have made. :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#264 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABSZEWZVS7J4FRHHUA25MTROCRSHANCNFSM4HXQGZUA>.
|
As this ticket is blocked by the development of the dependency resolver (#988), I thought I would mention here that the team is looking for help from the community to move forward on that subject. We need to better understand the circumstances under which the new resolver fails, so are asking for pip users with complex dependencies to:
You can find more information and more detailed instructions here |
Now that pypa/pip#988 is resolved, folks have started taking a fresh look at related issues starting at pypa/pip#9187 (comment) . |
I guess a terribly dumb question is can anyone point me to a place where I can control the metadata. For some reason the published package (using tgz) is not having the same metadata as I put in setup.py? |
You can inspect the metadata locally by building a distribution ( |
(Followup to discussion during packaging minisummit at PyCon North America in May 2019.)
Conda has a metadata solver. Pip and PyPI already or will soon know cases where package metadata is incorrect; how firmly/strictly should metadata correctness be enforced?
There's general agreement that strictness should be increased; the question is: how quickly and to what extent?
Issues
a. Agreement among PyPI working group to better enforce
manylinux
(Donald Stufft, Dustin Ingram, EWDIII) -- see "Run auditwheel on new manylinux uploads, reject if it fails #5420" Run auditwheel on new manylinux uploads, reject if it fails pypi/warehouse#5420b. EWDIII - There is no technical barrier for PyPI updating its metadata. There is a non-starter on updating the package/changing the metadata.
c. Also possibility for staged releases. This would allow composition and checks before release. (Nick + Ernest)
d. Can metadata can be corrected by producing a new wheel or post-release? Likely not by uploading a wheel.
e. Ability to yank packages (will not install for interval-based version specifications, but would still be available to install under exact/equality pinning) per PEP 592. New simple API change to support yanking (a la Ruby gems)
f. Metadata should not be able to differentiate between artefact and PyPI
manylinux-*
a. will/may require the solver
Action Items:
a. Chris Wilcox said "I am going to start on a validator in pypa/validator that can be leveraged at twine/setuptools/warehouse" -- has started it -- also see
twine check
should guard against things not accepted by PyPI like version format twine#430twine check
should guard against things not accepted by PyPI like version format endless timeouts #430b. Hard fail on invalid markup with explicit description type pypi/warehouse#3285 Warehouse to start hard failing package uploads on invalid markup with explicit description type
python_requires
a. Or could the spec/
setuptools
be updated to fail on thisb. Also, can we fail when there's missing
author
,author_email
,URL
. Currently warnings on setup. (chris wilcox)c. For packages where no restrictions on Python version are desired, a
“python_requires==*”
would be satisfactoryd. also see WIP: Add metadata validation setuptools#1562
a. Yes; support needs to be added; an issue needs to be created
This is meant as as a tracking issue covering the various TODOs necessary to plumb this through the parts of the toolchain.
The text was updated successfully, but these errors were encountered: