Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEP-458 Implementation (Secure downloads with signed metadata) #9041

Closed
wants to merge 48 commits into from

Conversation

jku
Copy link
Contributor

@jku jku commented Oct 23, 2020

Purpose

This PR adds support for the upcoming Warehouse feature where hashes of all index files and all distribution files are signed (initially by Warehouse at distribution upload time, maybe later by developers) as documented in PEP-458. With this PR pip can verify these signatures/hashes to attest that

  • files are the same files that were signed at distribution upload time
  • delivery process has not been tampered with

This is not a final version (it cannot be as the Warehouse API is not final) and I'm not trying to get this version merged: I'm hoping for high-level review that would point out possible design mistakes and barriers to eventual merging, and some discussion on the issues listed in the last section.

The related issue is #8585, I keep my branches and also track issues in https://github.com/jku/pip. To look at just the actual changes (without the vendored sources) see jku/pip@master...jku:tuf-mvp

Current state

This is Work-In-Progress but a functional one: It works against a mock Warehouse but has been tested in a limited capacity only as the real Warehouse feature is not quite ready yet.

A real Warehouse test instance should be available Real Soon Now: I suggest testing this branch only at that point (I will notify here when I've done the small changes that the test server might require). It will also be much easier to understand the metadata at that point when the files can just be downloaded for review from the Warehouse server.

Why the dependencies?

A practical download verification system cannot just verify signatures with set-in-stone keys: it must also handle things like revoking and replacing keys, and ideally should give assurances about the delivery process (e.g. prevent mix-and-match or indefinite-freeze attacks). This is non-trivial and a good reason to use external components. The required new dependencies are:

  • TUF handles the details of metadata management and also verifying the files and the delivery process
  • securesystemslib is used by TUF as the cryptography layer (in the pip case hashlib is used for hashes and an internal modified version of the reference implementation of Ed25519 is used for signature verification)

Both projects are vested in the Warehouse/pip integration: changes that are required to make pip work smoothly are possible (and many have already been done).

Currently these dependencies mean ~17k lines of vendored sources (these are comment heavy projects: sloccount counts 4630 lines of code total). If you'd rather look at a branch without the vendored sources, just skip the last commit or look at master...jku:tuf-mvp (this branch works with out-of-tree tuf and securesystemslib)

User visible changes

The goal is to be invisible to the user. This is currently not quite the case as progress indication does not happen when the TUF code paths are used (more on this later).

There is no user visible configuration or options. There should probably be a disabling option (like '--trusted-host') for TUF support but this does not exist yet.

I have some estimates of data transfer overhead with current TUF version (but Warehouse test server is needed to get proper numbers):

  • single package install typically transfers 5% more data (assuming average 2.2MB package)
  • a ten-package install typically transfers 1.7% more data (assuming average 2.2MB packages)
  • all of the increase is metadata downloads from the repository (pypi.org)

With small packages the overhead could be user-noticeable. In exceptional circumstances (like rotating high level keys) the overhead can be significantly larger, but these should be very rare events. The very first pip execution might have higher overhead as well, depending on the amount metadata we decide to ship with pip.

Design

This PR aims to minimize changes to existing pip code for simplicity: I am ready to refactor the code more if I get pointed to correct directions.

There are three main points of integration into current pip code:

  • SessionCommandMixin now initializes a SecureUpdateSession (this initialization consists of reading local TUF metadata files, and on first run writing the initial runtime metadata file) that will be used throughout this pip invocation
  • index.collector.LinkCollector.fetch_page() now has two code paths:
    • if the SecureUpdateSession has a SecureDownloader for the repository in question, it is used to download the html page
    • otherwise the old method is used
  • operations.prepare.get_http_url() now has two code paths:
    • if the SecureUpdateSession has a SecureDownloader for the repository in question, it is used to download the distribution file
    • otherwise the old method is used

network.secure_update contains the new components: SecureUpdateSession and SecureDownloader.

SecureUpdateSession has two purposes:

  • create a lookup table of repositories (index urls) to SecureDownloaders: this will be later used to decide if an index file or distribution file should be downloaded using a SecureDownloader or using the old methods
  • initialize cache and metadata directories. The latter is a new thing for pip: the TUF metadata gets updated at runtime but it's not cache as deleting it has negative consequences, so it's stored in USER_DATA_DIR (~/.local/share/pip/ on linux). The runtime metadata and the cache are shared by all pip installs. A pip install now includes 'bootstrap metadata' (aka the source of trust for the default repository) which will be installed into USER_DATA_DIR if the the runtime metadata does not exist yet.

SecureDownloader takes care of downloading index files and distribution files from a specific repository. Both cases consist of the same steps:

  • refresh TUF generic metadata (once per pip startup)
  • for every index or distribution file
    • download target specific metadata
    • check local cache
    • on a cache miss, download the target file

The differences between index and distribution file downloads are the mirror config (which server is being queried) and how the name of the TUF target (the identifier TUF uses for specific files) is formed.

Remaining Work

Discussion items

As I said, I'm not looking to merge this PR... but in addition to high level review I'm hoping for a discussion on whether these issues are likely to be blockers for eventually merging a PR like this

  • progress indication: Currently TUF controls the file download completely: progress handlers are not possible. Adding progress handlers into TUF is certainly a possibility (but see below)
  • control over download details: As previously said, TUF controls the download. This prevents the sort of micro management of http download details that PIP currently does. If this seems unacceptable, there are possible TUF design changes that can be discussed (so pip would be in charge of actually fetching the files: this would of course solve the progress indication issue as well).
  • threads: TUF does not support use from multiple threads. This is currently not an issue in pip, except in "pip list -o/-u" -- that multithreading is disabled in the PR when TUF is used

Jussi Kukkonen added 30 commits October 21, 2020 11:17
This prevents Updater from making distribution file requests to the
index file mirror (pypi.org). It's possible to do the same for index
file requests to distribution file mirror but this is less critical as
modern python dictionary is ordered: index file mirror is always tested
first.
get_tuf_updaters() returns a dictionary of
  index_url => Updater
configured with appropriate cache and metadata
directories
This is still not even nearly correct but it's easier:
Everything gets downloaded into cache dir, and the path gets
returned. Let's make it work in a simple way first...
When we have local TUF metadata for the repository, use TUF to download
the distribution files.

This has several design issues and some documented bugs but works when
 * cache is used
 * local TUF metadata for the repo is present in ~/.local/share/pip/tuf/
* Implemented in LinkCollector
* Expects the index files to exist at path
  {INDEX_URL}/{PROJECT}/{HASH}.index.html
* this now leads to confined_target_dirs not being usable, meaning that
  the index file server will get requests for data files: this should
  not happen, we should do some kind of mirror-switcheroo instead

The code now expects LinkCollector to only be used for project urls:
this may be reasonable or not...

This commit probably breaks most other commands than 'install':
LinkCollector API has changed and those call sites have not been
updated
Current TUF Updater mirror config means queries to the wrong server:
theupdateframework/python-tuf#1143

Workaround by storing two separate mirror configurations: one for
index file downloads and one for distribution file downloads.
This still requires knowledge of the url structure, but should now work
on servers that don't have the entrypoint '/packages/'.
When --no-cache is given, cache_dir is None. Use a temporary directory
in this case.
We need to ship initial metadata for supported repositories (read
pypi.org) with pip. This is used to populate the actual metadata
directory if it does not exist.

This commit includes a placeholder metadata for repository
"https://localhost:8000/simple/": it should be replaced with pypi.org
metadata once it is available.
This is what seems to be expected when distribution download fails
This happens e.g. when user tries to install a non-existent project.
Also remove some comments about cache: it does work now
Add a list of index urls that really should work securely
(in the future this should include pypi.org). Raise an exception
if we do not find a secure downloader for an index url in the list.
Also move the cache/data dir selection inside SecureUpdateSession.
Note that list is the only place that does multithreaded repository
access: This will fail with TUF so the parallelism is disabled if
TUF is being used.
SessionCommandMixin now handles both PipSession and
SecureUpdateSession.
We parse the index url from a project url in two places
(before looking up a SecureDownloader): use similar
methods and comments.
This is test metadata only
Jussi Kukkonen added 10 commits October 22, 2020 15:15
This allows us to separate missing local metadata (which typically
happens for every repository that we do not ship metadata for), and
actual errors while reading local metadata.

Requires tuf 0.15
New TUF allows us to
* not specify targets_path or metadata_path if the mirror does not
  serve any targets or metadata respectively.
* not specify confined_target_dirs (default is no confinement)

Requires tuf 0.15
This does not include the vendored sources themselves, run:
  vendoring sync -v

The patches are a temporary hack: vendoring tool cannot cope with the
import style used so this is a stop gap solution
the api package was added in tuf 0.15: it is not needed by the client

LICENSE files were added in 0.15
@brainwane
Copy link
Contributor

I think @dstufft @xavfernandez @pfmoore @cjerdonek and @chrahunt might be particularly interested in taking a look at this?

@pfmoore
Copy link
Member

pfmoore commented Nov 3, 2020

Sorry, I'm not familiar with TUF and I'm not a security specialist, so I'd prefer to leave it to others.

@joshuagl
Copy link

joshuagl commented Nov 3, 2020

Note that the discussion items in this PR are not related to TUF or security. They are questions for pip maintainers about whether some changes to the way pip works (progress indication, downloading distributions and, use of multithreading) would be acceptable.

@pfmoore
Copy link
Member

pfmoore commented Nov 3, 2020

OK. I'm still not going to review a 65-file PR, but I can give you my high-level view.

  1. I'm not keen on losing control of download progress reporting. I think pip probably does too much reporting, so I'm not actually advocating for pip doing any more than it does at the moment, but I think that being completely silent while downloading a big package like scipy or pytorch is probably not ideal.
  2. I don't personally need control over things like certificates, retry/timeout parameters, etc, and I set my proxy via environment variables, but many pip users do need these features. I'm not a fan of pip exposing (and hence having to manage) all of these network details, so I'd in theory be OK with passing that off to some external library. However:
    • TUF would need to provide the same functionality as pip, or demonstrate that it's not needed. We can't drop features users need without a reason.
    • It would be all or nothing. There's no way we could have pip settings being used for private indexes or direct https URLs, or things like pip list or pip search, but TUF settings for PyPI downloads. That would be a UI and maintenance mess. So the TUF libraries would need to handle all of pip's network needs, not just TUF-managed ones.
  3. Lack of multithreading is not a current issue. I'm somewhat concerned that we could be locking out potential future optimisations, but the reality is that no-one has yet demonstrated hard benefits from multithreading¹, so any objection I have is based on speculation. I am a little uncomfortable if we're locking ourselves out of any sort of parallelism in downloads forever, though - is the "no multithreading" limitation a fundamental feature of the TUF design, or just an implementation detail of how it currently works? (I'm not asking for change here, just trying to understand the possibilities).

¹ The GSOC work this year looked at this, but I believe the results were inconclusive, so I don't think the door is completely closed on the possibility.

@McSinyx
Copy link
Contributor

McSinyx commented Nov 4, 2020

The GSOC work this year looked at this, but I believe the results were inconclusive, so I don't think the door is completely closed on the possibility.

Can confirm, I'm the GSoC student.

@jku
Copy link
Contributor Author

jku commented Nov 4, 2020

Thanks Paul, very useful. I'll give others time to comment as well ... just one comment to not scare others off:

I'm still not going to review a 65-file PR

That number includes the vendoring: vendored files are included in the PR to show that the tests pass but admittedly they also make high-level review tricky. master..tuf-mvp shows the changes in pip: actual code changes are ~450 lines, the major ones in four files.

If this feels like a difficult PR to comment on, I'm willing to do more work to enable review -- I'm just not sure what that could be? E.g. a zoom-call or a chat is certainly possible.

@jku
Copy link
Contributor Author

jku commented Nov 13, 2020

An update on the discussion items:

  • On control over download (e.g. progress indication and canceling) as well as control over the session configuration (proxies, timeouts, etc): Paul makes good points and broadly speaking I agree. I've had a look at the TUF code and there may be a fairly painless change we could do there to let pip handle the downloads completely updater: abstract out the network IO theupdateframework/python-tuf#1213. It's not necessarily something I'll implement without knowing that it's needed but it's certainly an option on the table
  • On the topic of parallel downloads: I could see parallelizing actual target downloads (so pypi index files and distribution files) as a future feature -- especially if pip handles the actual download as discussed above. However, current API is blocking and single-threaded and I don't see that changing in near future: the sort of parallelizing that I could see happening without major rewrites is that download_target() could accept several target names and the custom download implementation could then download those targets (distribution files) in parallel.

If there are more comments, I'd love to hear them! If you're still considering reviewing: I suggest jku/pip@master...jku:tuf-mvp (branch without vendored dependencies).

@pradyunsg
Copy link
Member

pradyunsg commented Dec 29, 2020

I've finally been able to take out a big-enough chunk of time to review this. (update from a future me: This post is gonna be in bullet points and is generally very "rough", because this took way longer than I expected -- I'm overtime and have make dinner! Apologies!)

  • You'd likely want to rebase this PR off the current master branch, since there's significant CI and Python-support changes since this PR was filed. :)
  • Naming: SecureDownloadSession implies that it implements the requests.Session interface, like PipSession does. Similarly, we'd probably want to not call the class SecureDownloader since it does not implement the same interface as Downloader classes in the codebase (in pip._internal.network.download). Perhaps SecureDownloadSession -> SecureRepositoryManager and SecureDownloader -> SecureRepository would be better names for them?
  • I think the newly introduced SecureDownloadSession is being passed around way too much. This feels like a red-flag to me, in terms of design. I think it can become a helper class that's completely contained within PipSession and be managed entirely within that? If so, this will help simplify a whole bunch of "passing this new thing around", and make this change easier to understand as well.
  • Download progress indication: This is a really important part of the download process, and it would be a major regression to lose it. As someone currently using crappy internet, I'm sure that pip losing the capability to print progress bars will not be well recieved by our users. :)
  • Speculative/broad: Would it make sense to break up the tuf package into smaller tuf_client / tuf_server packages, since it's quite unlikely that the same downstream codebase needs both components?
  • Vendoring and patching: uhm... let's make the changes upstream, so that we don't have to keep that... scary, fragile patch?

If this feels like a difficult PR to comment on, I'm willing to do more work to enable review -- I'm just not sure what that could be? E.g. a zoom-call or a chat is certainly possible.

I usually nudge folks to either (a) break up PRs into smaller chunks, which makes it much easier to review them or (b) create a good "story" in the commits of a big PR, such that it can be reviewed in a commit-by-commit manner across multiple short bursts.

Given the generally asynchronous nature + limited availability of pip's maintainers, either of those might be a worthwhile investment. :)

@jku
Copy link
Contributor Author

jku commented Jan 11, 2021

Thanks for your review, this is what I was looking for. All your comments seem reasonable: if I don't respond to a specific item, assume I agree 100%.

  • Naming: you make good points, I'll work on both names
  • SecureDownloadSession being passed around too much: I agree it's ugly. I originally did exactly what you suggested but it felt a bit unclean as PipSession is really customized requests.Session -- adding a somewhat unrelated thing in there seemed unclean. You may be right that it's still a better choice: it would definitely make the patch cleaner
  • download progress & vendoring patches: Fully agreed, the required TUF changes are already being worked on
  • breaking up the tuf package: This is on the road map but might not happen right away

I'm going to work on the TUF changes in the next weeks, then do another pass at this.

@pradyunsg
Copy link
Member

Thanks! I do appreciate the work you're doing to push this forward. :)

@BrownTruck
Copy link
Contributor

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will be eligible for code review and hopefully merging!

@BrownTruck BrownTruck added the needs rebase or merge PR has conflicts with current master label Feb 4, 2021
@pradyunsg
Copy link
Member

@jku What's the state of affairs for this PR? I've been closing most of the outdated PRs that the bot flagged last month.

For this one tho, I'm a bit wary of closing it as "closed due to lack of activity" due to the long feedback loops we've had here. Would you be able to update this PR, or better yet, break this up into smaller chunks? :)

@jku
Copy link
Contributor Author

jku commented Mar 5, 2021

Thanks for the poke. I think I'll close this: I am working on this again starting next week (since all dependency changes are now done) but it might still take some time and, based on the feedback, look quite a bit different when ready: so I'll just open new pr(s) and link to this earlier discussion at that point.

@jku jku closed this Mar 5, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs rebase or merge PR has conflicts with current master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants