-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PEP-458 Implementation (Secure downloads with signed metadata) #9041
Conversation
This prevents Updater from making distribution file requests to the index file mirror (pypi.org). It's possible to do the same for index file requests to distribution file mirror but this is less critical as modern python dictionary is ordered: index file mirror is always tested first.
get_tuf_updaters() returns a dictionary of index_url => Updater configured with appropriate cache and metadata directories
This is still not even nearly correct but it's easier: Everything gets downloaded into cache dir, and the path gets returned. Let's make it work in a simple way first...
When we have local TUF metadata for the repository, use TUF to download the distribution files. This has several design issues and some documented bugs but works when * cache is used * local TUF metadata for the repo is present in ~/.local/share/pip/tuf/
* Implemented in LinkCollector * Expects the index files to exist at path {INDEX_URL}/{PROJECT}/{HASH}.index.html * this now leads to confined_target_dirs not being usable, meaning that the index file server will get requests for data files: this should not happen, we should do some kind of mirror-switcheroo instead The code now expects LinkCollector to only be used for project urls: this may be reasonable or not... This commit probably breaks most other commands than 'install': LinkCollector API has changed and those call sites have not been updated
Current TUF Updater mirror config means queries to the wrong server: theupdateframework/python-tuf#1143 Workaround by storing two separate mirror configurations: one for index file downloads and one for distribution file downloads.
This still requires knowledge of the url structure, but should now work on servers that don't have the entrypoint '/packages/'.
When --no-cache is given, cache_dir is None. Use a temporary directory in this case.
We need to ship initial metadata for supported repositories (read pypi.org) with pip. This is used to populate the actual metadata directory if it does not exist. This commit includes a placeholder metadata for repository "https://localhost:8000/simple/": it should be replaced with pypi.org metadata once it is available.
This is what seems to be expected when distribution download fails
This happens e.g. when user tries to install a non-existent project.
Also remove some comments about cache: it does work now
Add a list of index urls that really should work securely (in the future this should include pypi.org). Raise an exception if we do not find a secure downloader for an index url in the list.
Also move the cache/data dir selection inside SecureUpdateSession.
Note that list is the only place that does multithreaded repository access: This will fail with TUF so the parallelism is disabled if TUF is being used.
SessionCommandMixin now handles both PipSession and SecureUpdateSession.
We parse the index url from a project url in two places (before looking up a SecureDownloader): use similar methods and comments.
This is test metadata only
This allows us to separate missing local metadata (which typically happens for every repository that we do not ship metadata for), and actual errors while reading local metadata. Requires tuf 0.15
New TUF allows us to * not specify targets_path or metadata_path if the mirror does not serve any targets or metadata respectively. * not specify confined_target_dirs (default is no confinement) Requires tuf 0.15
This does not include the vendored sources themselves, run: vendoring sync -v The patches are a temporary hack: vendoring tool cannot cope with the import style used so this is a stop gap solution
the api package was added in tuf 0.15: it is not needed by the client LICENSE files were added in 0.15
I think @dstufft @xavfernandez @pfmoore @cjerdonek and @chrahunt might be particularly interested in taking a look at this? |
Sorry, I'm not familiar with TUF and I'm not a security specialist, so I'd prefer to leave it to others. |
Note that the discussion items in this PR are not related to TUF or security. They are questions for pip maintainers about whether some changes to the way pip works (progress indication, downloading distributions and, use of multithreading) would be acceptable. |
OK. I'm still not going to review a 65-file PR, but I can give you my high-level view.
¹ The GSOC work this year looked at this, but I believe the results were inconclusive, so I don't think the door is completely closed on the possibility. |
Can confirm, I'm the GSoC student. |
Thanks Paul, very useful. I'll give others time to comment as well ... just one comment to not scare others off:
That number includes the vendoring: vendored files are included in the PR to show that the tests pass but admittedly they also make high-level review tricky. master..tuf-mvp shows the changes in pip: actual code changes are ~450 lines, the major ones in four files. If this feels like a difficult PR to comment on, I'm willing to do more work to enable review -- I'm just not sure what that could be? E.g. a zoom-call or a chat is certainly possible. |
An update on the discussion items:
If there are more comments, I'd love to hear them! If you're still considering reviewing: I suggest jku/pip@master...jku:tuf-mvp (branch without vendored dependencies). |
I've finally been able to take out a big-enough chunk of time to review this. (update from a future me: This post is gonna be in bullet points and is generally very "rough", because this took way longer than I expected -- I'm overtime and have make dinner! Apologies!)
I usually nudge folks to either (a) break up PRs into smaller chunks, which makes it much easier to review them or (b) create a good "story" in the commits of a big PR, such that it can be reviewed in a commit-by-commit manner across multiple short bursts. Given the generally asynchronous nature + limited availability of pip's maintainers, either of those might be a worthwhile investment. :) |
Thanks for your review, this is what I was looking for. All your comments seem reasonable: if I don't respond to a specific item, assume I agree 100%.
I'm going to work on the TUF changes in the next weeks, then do another pass at this. |
Thanks! I do appreciate the work you're doing to push this forward. :) |
Hello! I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the |
@jku What's the state of affairs for this PR? I've been closing most of the outdated PRs that the bot flagged last month. For this one tho, I'm a bit wary of closing it as "closed due to lack of activity" due to the long feedback loops we've had here. Would you be able to update this PR, or better yet, break this up into smaller chunks? :) |
Thanks for the poke. I think I'll close this: I am working on this again starting next week (since all dependency changes are now done) but it might still take some time and, based on the feedback, look quite a bit different when ready: so I'll just open new pr(s) and link to this earlier discussion at that point. |
Purpose
This PR adds support for the upcoming Warehouse feature where hashes of all index files and all distribution files are signed (initially by Warehouse at distribution upload time, maybe later by developers) as documented in PEP-458. With this PR pip can verify these signatures/hashes to attest that
This is not a final version (it cannot be as the Warehouse API is not final) and I'm not trying to get this version merged: I'm hoping for high-level review that would point out possible design mistakes and barriers to eventual merging, and some discussion on the issues listed in the last section.
The related issue is #8585, I keep my branches and also track issues in https://github.com/jku/pip. To look at just the actual changes (without the vendored sources) see jku/pip@master...jku:tuf-mvp
Current state
This is Work-In-Progress but a functional one: It works against a mock Warehouse but has been tested in a limited capacity only as the real Warehouse feature is not quite ready yet.
A real Warehouse test instance should be available Real Soon Now: I suggest testing this branch only at that point (I will notify here when I've done the small changes that the test server might require). It will also be much easier to understand the metadata at that point when the files can just be downloaded for review from the Warehouse server.
Why the dependencies?
A practical download verification system cannot just verify signatures with set-in-stone keys: it must also handle things like revoking and replacing keys, and ideally should give assurances about the delivery process (e.g. prevent mix-and-match or indefinite-freeze attacks). This is non-trivial and a good reason to use external components. The required new dependencies are:
Both projects are vested in the Warehouse/pip integration: changes that are required to make pip work smoothly are possible (and many have already been done).
Currently these dependencies mean ~17k lines of vendored sources (these are comment heavy projects: sloccount counts 4630 lines of code total). If you'd rather look at a branch without the vendored sources, just skip the last commit or look at master...jku:tuf-mvp (this branch works with out-of-tree tuf and securesystemslib)
User visible changes
The goal is to be invisible to the user. This is currently not quite the case as progress indication does not happen when the TUF code paths are used (more on this later).
There is no user visible configuration or options. There should probably be a disabling option (like '--trusted-host') for TUF support but this does not exist yet.
I have some estimates of data transfer overhead with current TUF version (but Warehouse test server is needed to get proper numbers):
With small packages the overhead could be user-noticeable. In exceptional circumstances (like rotating high level keys) the overhead can be significantly larger, but these should be very rare events. The very first pip execution might have higher overhead as well, depending on the amount metadata we decide to ship with pip.
Design
This PR aims to minimize changes to existing pip code for simplicity: I am ready to refactor the code more if I get pointed to correct directions.
There are three main points of integration into current pip code:
SessionCommandMixin
now initializes aSecureUpdateSession
(this initialization consists of reading local TUF metadata files, and on first run writing the initial runtime metadata file) that will be used throughout this pip invocationindex.collector.LinkCollector.fetch_page()
now has two code paths:SecureUpdateSession
has aSecureDownloader
for the repository in question, it is used to download the html pageoperations.prepare.get_http_url()
now has two code paths:SecureUpdateSession
has aSecureDownloader
for the repository in question, it is used to download the distribution filenetwork.secure_update contains the new components:
SecureUpdateSession
andSecureDownloader
.SecureUpdateSession
has two purposes:SecureDownloader
or using the old methodsUSER_DATA_DIR
(~/.local/share/pip/ on linux). The runtime metadata and the cache are shared by all pip installs. A pip install now includes 'bootstrap metadata' (aka the source of trust for the default repository) which will be installed intoUSER_DATA_DIR
if the the runtime metadata does not exist yet.SecureDownloader
takes care of downloading index files and distribution files from a specific repository. Both cases consist of the same steps:The differences between index and distribution file downloads are the mirror config (which server is being queried) and how the name of the TUF target (the identifier TUF uses for specific files) is formed.
Remaining Work
LinkCollector.fetch_page()
andoperations.prepare.get_http_url()
as trivially understandable as possibleDiscussion items
As I said, I'm not looking to merge this PR... but in addition to high level review I'm hoping for a discussion on whether these issues are likely to be blockers for eventually merging a PR like this