PEP-458 Implementation (Secure downloads with signed metadata) #9041

jku · 2020-10-23T14:01:47Z

Purpose

This PR adds support for the upcoming Warehouse feature where hashes of all index files and all distribution files are signed (initially by Warehouse at distribution upload time, maybe later by developers) as documented in PEP-458. With this PR pip can verify these signatures/hashes to attest that

files are the same files that were signed at distribution upload time
delivery process has not been tampered with

This is not a final version (it cannot be as the Warehouse API is not final) and I'm not trying to get this version merged: I'm hoping for high-level review that would point out possible design mistakes and barriers to eventual merging, and some discussion on the issues listed in the last section.

The related issue is #8585, I keep my branches and also track issues in https://github.com/jku/pip. To look at just the actual changes (without the vendored sources) see jku/pip@master...jku:tuf-mvp

Current state

This is Work-In-Progress but a functional one: It works against a mock Warehouse but has been tested in a limited capacity only as the real Warehouse feature is not quite ready yet.

A real Warehouse test instance should be available Real Soon Now: I suggest testing this branch only at that point (I will notify here when I've done the small changes that the test server might require). It will also be much easier to understand the metadata at that point when the files can just be downloaded for review from the Warehouse server.

Why the dependencies?

A practical download verification system cannot just verify signatures with set-in-stone keys: it must also handle things like revoking and replacing keys, and ideally should give assurances about the delivery process (e.g. prevent mix-and-match or indefinite-freeze attacks). This is non-trivial and a good reason to use external components. The required new dependencies are:

TUF handles the details of metadata management and also verifying the files and the delivery process
securesystemslib is used by TUF as the cryptography layer (in the pip case hashlib is used for hashes and an internal modified version of the reference implementation of Ed25519 is used for signature verification)

Both projects are vested in the Warehouse/pip integration: changes that are required to make pip work smoothly are possible (and many have already been done).

Currently these dependencies mean ~17k lines of vendored sources (these are comment heavy projects: sloccount counts 4630 lines of code total). If you'd rather look at a branch without the vendored sources, just skip the last commit or look at master...jku:tuf-mvp (this branch works with out-of-tree tuf and securesystemslib)

User visible changes

The goal is to be invisible to the user. This is currently not quite the case as progress indication does not happen when the TUF code paths are used (more on this later).

There is no user visible configuration or options. There should probably be a disabling option (like '--trusted-host') for TUF support but this does not exist yet.

I have some estimates of data transfer overhead with current TUF version (but Warehouse test server is needed to get proper numbers):

single package install typically transfers 5% more data (assuming average 2.2MB package)
a ten-package install typically transfers 1.7% more data (assuming average 2.2MB packages)
all of the increase is metadata downloads from the repository (pypi.org)

With small packages the overhead could be user-noticeable. In exceptional circumstances (like rotating high level keys) the overhead can be significantly larger, but these should be very rare events. The very first pip execution might have higher overhead as well, depending on the amount metadata we decide to ship with pip.

Design

This PR aims to minimize changes to existing pip code for simplicity: I am ready to refactor the code more if I get pointed to correct directions.

There are three main points of integration into current pip code:

SessionCommandMixin now initializes a SecureUpdateSession (this initialization consists of reading local TUF metadata files, and on first run writing the initial runtime metadata file) that will be used throughout this pip invocation
index.collector.LinkCollector.fetch_page() now has two code paths:
- if the SecureUpdateSession has a SecureDownloader for the repository in question, it is used to download the html page
- otherwise the old method is used
operations.prepare.get_http_url() now has two code paths:
- if the SecureUpdateSession has a SecureDownloader for the repository in question, it is used to download the distribution file
- otherwise the old method is used

network.secure_update contains the new components: SecureUpdateSession and SecureDownloader.

SecureUpdateSession has two purposes:

create a lookup table of repositories (index urls) to SecureDownloaders: this will be later used to decide if an index file or distribution file should be downloaded using a SecureDownloader or using the old methods
initialize cache and metadata directories. The latter is a new thing for pip: the TUF metadata gets updated at runtime but it's not cache as deleting it has negative consequences, so it's stored in USER_DATA_DIR (~/.local/share/pip/ on linux). The runtime metadata and the cache are shared by all pip installs. A pip install now includes 'bootstrap metadata' (aka the source of trust for the default repository) which will be installed into USER_DATA_DIR if the the runtime metadata does not exist yet.

SecureDownloader takes care of downloading index files and distribution files from a specific repository. Both cases consist of the same steps:

refresh TUF generic metadata (once per pip startup)
for every index or distribution file
- download target specific metadata
- check local cache
- on a cache miss, download the target file

The differences between index and distribution file downloads are the mirror config (which server is being queried) and how the name of the TUF target (the identifier TUF uses for specific files) is formed.

Remaining Work

Warehouse API is not 100% confirmed yet (resolving tuf metadata url for Warehouse index url jku/pip#5, resolving TUF target name for project name jku/pip#14)
vendoring: the vendoring tool cannot handle TUF and securesystemslib: current temporary solution is an ugly patch (import-style prevents vendoring theupdateframework/python-tuf#1165)
Refactor/review the url parsing/canonicalization (in network.secure_update, but also LinkCollector and prepare.get_http_url()): try to make LinkCollector.fetch_page() and operations.prepare.get_http_url() as trivially understandable as possible
tests: none exist yet. This needs to be carefully thought out so duplication with current tests isn't needed
stress testing: should automate downloading of a significant number of packages once the test server is up
other items in https://github.com/jku/pip/issues

Discussion items

As I said, I'm not looking to merge this PR... but in addition to high level review I'm hoping for a discussion on whether these issues are likely to be blockers for eventually merging a PR like this

progress indication: Currently TUF controls the file download completely: progress handlers are not possible. Adding progress handlers into TUF is certainly a possibility (but see below)
control over download details: As previously said, TUF controls the download. This prevents the sort of micro management of http download details that PIP currently does. If this seems unacceptable, there are possible TUF design changes that can be discussed (so pip would be in charge of actually fetching the files: this would of course solve the progress indication issue as well).
threads: TUF does not support use from multiple threads. This is currently not an issue in pip, except in "pip list -o/-u" -- that multithreading is disabled in the PR when TUF is used

This prevents Updater from making distribution file requests to the index file mirror (pypi.org). It's possible to do the same for index file requests to distribution file mirror but this is less critical as modern python dictionary is ordered: index file mirror is always tested first.

get_tuf_updaters() returns a dictionary of index_url => Updater configured with appropriate cache and metadata directories

This is still not even nearly correct but it's easier: Everything gets downloaded into cache dir, and the path gets returned. Let's make it work in a simple way first...

When we have local TUF metadata for the repository, use TUF to download the distribution files. This has several design issues and some documented bugs but works when * cache is used * local TUF metadata for the repo is present in ~/.local/share/pip/tuf/

* Implemented in LinkCollector * Expects the index files to exist at path {INDEX_URL}/{PROJECT}/{HASH}.index.html * this now leads to confined_target_dirs not being usable, meaning that the index file server will get requests for data files: this should not happen, we should do some kind of mirror-switcheroo instead The code now expects LinkCollector to only be used for project urls: this may be reasonable or not... This commit probably breaks most other commands than 'install': LinkCollector API has changed and those call sites have not been updated

Current TUF Updater mirror config means queries to the wrong server: theupdateframework/python-tuf#1143 Workaround by storing two separate mirror configurations: one for index file downloads and one for distribution file downloads.

This still requires knowledge of the url structure, but should now work on servers that don't have the entrypoint '/packages/'.

When --no-cache is given, cache_dir is None. Use a temporary directory in this case.

We need to ship initial metadata for supported repositories (read pypi.org) with pip. This is used to populate the actual metadata directory if it does not exist. This commit includes a placeholder metadata for repository "https://localhost:8000/simple/": it should be replaced with pypi.org metadata once it is available.

This is what seems to be expected when distribution download fails

This happens e.g. when user tries to install a non-existent project.

Also remove some comments about cache: it does work now

Add a list of index urls that really should work securely (in the future this should include pypi.org). Raise an exception if we do not find a secure downloader for an index url in the list.

Also move the cache/data dir selection inside SecureUpdateSession.

Note that list is the only place that does multithreaded repository access: This will fail with TUF so the parallelism is disabled if TUF is being used.

SessionCommandMixin now handles both PipSession and SecureUpdateSession.

We parse the index url from a project url in two places (before looking up a SecureDownloader): use similar methods and comments.

This is test metadata only

This allows us to separate missing local metadata (which typically happens for every repository that we do not ship metadata for), and actual errors while reading local metadata. Requires tuf 0.15

New TUF allows us to * not specify targets_path or metadata_path if the mirror does not serve any targets or metadata respectively. * not specify confined_target_dirs (default is no confinement) Requires tuf 0.15

This does not include the vendored sources themselves, run: vendoring sync -v The patches are a temporary hack: vendoring tool cannot cope with the import style used so this is a stop gap solution

the api package was added in tuf 0.15: it is not needed by the client LICENSE files were added in 0.15

brainwane · 2020-11-03T14:09:30Z

I think @dstufft @xavfernandez @pfmoore @cjerdonek and @chrahunt might be particularly interested in taking a look at this?

pfmoore · 2020-11-03T14:15:18Z

Sorry, I'm not familiar with TUF and I'm not a security specialist, so I'd prefer to leave it to others.

joshuagl · 2020-11-03T17:16:14Z

Note that the discussion items in this PR are not related to TUF or security. They are questions for pip maintainers about whether some changes to the way pip works (progress indication, downloading distributions and, use of multithreading) would be acceptable.

pfmoore · 2020-11-03T17:45:10Z

OK. I'm still not going to review a 65-file PR, but I can give you my high-level view.

I'm not keen on losing control of download progress reporting. I think pip probably does too much reporting, so I'm not actually advocating for pip doing any more than it does at the moment, but I think that being completely silent while downloading a big package like scipy or pytorch is probably not ideal.
I don't personally need control over things like certificates, retry/timeout parameters, etc, and I set my proxy via environment variables, but many pip users do need these features. I'm not a fan of pip exposing (and hence having to manage) all of these network details, so I'd in theory be OK with passing that off to some external library. However:
- TUF would need to provide the same functionality as pip, or demonstrate that it's not needed. We can't drop features users need without a reason.
- It would be all or nothing. There's no way we could have pip settings being used for private indexes or direct https URLs, or things like pip list or pip search, but TUF settings for PyPI downloads. That would be a UI and maintenance mess. So the TUF libraries would need to handle all of pip's network needs, not just TUF-managed ones.
Lack of multithreading is not a current issue. I'm somewhat concerned that we could be locking out potential future optimisations, but the reality is that no-one has yet demonstrated hard benefits from multithreading¹, so any objection I have is based on speculation. I am a little uncomfortable if we're locking ourselves out of any sort of parallelism in downloads forever, though - is the "no multithreading" limitation a fundamental feature of the TUF design, or just an implementation detail of how it currently works? (I'm not asking for change here, just trying to understand the possibilities).

¹ The GSOC work this year looked at this, but I believe the results were inconclusive, so I don't think the door is completely closed on the possibility.

McSinyx · 2020-11-04T04:04:19Z

The GSOC work this year looked at this, but I believe the results were inconclusive, so I don't think the door is completely closed on the possibility.

Can confirm, I'm the GSoC student.

jku · 2020-11-04T08:59:41Z

Thanks Paul, very useful. I'll give others time to comment as well ... just one comment to not scare others off:

I'm still not going to review a 65-file PR

That number includes the vendoring: vendored files are included in the PR to show that the tests pass but admittedly they also make high-level review tricky. master..tuf-mvp shows the changes in pip: actual code changes are ~450 lines, the major ones in four files.

If this feels like a difficult PR to comment on, I'm willing to do more work to enable review -- I'm just not sure what that could be? E.g. a zoom-call or a chat is certainly possible.

jku · 2020-11-13T13:15:59Z

An update on the discussion items:

On control over download (e.g. progress indication and canceling) as well as control over the session configuration (proxies, timeouts, etc): Paul makes good points and broadly speaking I agree. I've had a look at the TUF code and there may be a fairly painless change we could do there to let pip handle the downloads completely updater: abstract out the network IO theupdateframework/python-tuf#1213. It's not necessarily something I'll implement without knowing that it's needed but it's certainly an option on the table
On the topic of parallel downloads: I could see parallelizing actual target downloads (so pypi index files and distribution files) as a future feature -- especially if pip handles the actual download as discussed above. However, current API is blocking and single-threaded and I don't see that changing in near future: the sort of parallelizing that I could see happening without major rewrites is that download_target() could accept several target names and the custom download implementation could then download those targets (distribution files) in parallel.

If there are more comments, I'd love to hear them! If you're still considering reviewing: I suggest jku/pip@master...jku:tuf-mvp (branch without vendored dependencies).

pradyunsg · 2020-12-29T18:44:12Z

I've finally been able to take out a big-enough chunk of time to review this. (update from a future me: This post is gonna be in bullet points and is generally very "rough", because this took way longer than I expected -- I'm overtime and have make dinner! Apologies!)

You'd likely want to rebase this PR off the current master branch, since there's significant CI and Python-support changes since this PR was filed. :)
Naming: SecureDownloadSession implies that it implements the requests.Session interface, like PipSession does. Similarly, we'd probably want to not call the class SecureDownloader since it does not implement the same interface as Downloader classes in the codebase (in pip._internal.network.download). Perhaps SecureDownloadSession -> SecureRepositoryManager and SecureDownloader -> SecureRepository would be better names for them?
I think the newly introduced SecureDownloadSession is being passed around way too much. This feels like a red-flag to me, in terms of design. I think it can become a helper class that's completely contained within PipSession and be managed entirely within that? If so, this will help simplify a whole bunch of "passing this new thing around", and make this change easier to understand as well.
Download progress indication: This is a really important part of the download process, and it would be a major regression to lose it. As someone currently using crappy internet, I'm sure that pip losing the capability to print progress bars will not be well recieved by our users. :)
Speculative/broad: Would it make sense to break up the tuf package into smaller tuf_client / tuf_server packages, since it's quite unlikely that the same downstream codebase needs both components?
Vendoring and patching: uhm... let's make the changes upstream, so that we don't have to keep that... scary, fragile patch?

If this feels like a difficult PR to comment on, I'm willing to do more work to enable review -- I'm just not sure what that could be? E.g. a zoom-call or a chat is certainly possible.

I usually nudge folks to either (a) break up PRs into smaller chunks, which makes it much easier to review them or (b) create a good "story" in the commits of a big PR, such that it can be reviewed in a commit-by-commit manner across multiple short bursts.

Given the generally asynchronous nature + limited availability of pip's maintainers, either of those might be a worthwhile investment. :)

jku · 2021-01-11T14:27:41Z

Thanks for your review, this is what I was looking for. All your comments seem reasonable: if I don't respond to a specific item, assume I agree 100%.

Naming: you make good points, I'll work on both names
SecureDownloadSession being passed around too much: I agree it's ugly. I originally did exactly what you suggested but it felt a bit unclean as PipSession is really customized requests.Session -- adding a somewhat unrelated thing in there seemed unclean. You may be right that it's still a better choice: it would definitely make the patch cleaner
download progress & vendoring patches: Fully agreed, the required TUF changes are already being worked on
breaking up the tuf package: This is on the road map but might not happen right away

I'm going to work on the TUF changes in the next weeks, then do another pass at this.

pradyunsg · 2021-01-11T19:23:17Z

Thanks! I do appreciate the work you're doing to push this forward. :)

BrownTruck · 2021-02-04T16:02:25Z

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will be eligible for code review and hopefully merging!

pradyunsg · 2021-03-05T16:49:16Z

@jku What's the state of affairs for this PR? I've been closing most of the outdated PRs that the bot flagged last month.

For this one tho, I'm a bit wary of closing it as "closed due to lack of activity" due to the long feedback loops we've had here. Would you be able to update this PR, or better yet, break this up into smaller chunks? :)

jku · 2021-03-05T16:55:06Z

Thanks for the poke. I think I'll close this: I am working on this again starting next week (since all dependency changes are now done) but it might still take some time and, based on the feedback, look quite a bit different when ready: so I'll just open new pr(s) and link to this earlier discussion at that point.

Jussi Kukkonen added 30 commits October 21, 2020 11:17

Initial tuf-handling code -- no integration yet

a6ff178

Expose USER_DATA_DIR in locations

bd97b78

SessionCommandMixin: Add get_tuf_updaters()

0e922df

get_tuf_updaters() returns a dictionary of index_url => Updater configured with appropriate cache and metadata directories

tuf: Make the cache handling more consistent

2246d9b

This is still not even nearly correct but it's easier: Everything gets downloaded into cache dir, and the path gets returned. Let's make it work in a simple way first...

TUF: Add Updater.__str__()

23be41c

tuf: Workaround the mirror config issues

484ed5a

Current TUF Updater mirror config means queries to the wrong server: theupdateframework/python-tuf#1143 Workaround by storing two separate mirror configurations: one for index file downloads and one for distribution file downloads.

tuf: set log level to ERROR

7aecaeb

Rename, refactor

17ceb69

secure_updater: Split distribution url more robustly

dcba9db

This still requires knowledge of the url structure, but should now work on servers that don't have the entrypoint '/packages/'.

self version check: Fix to work with SecureUpdateSession

a1758ab

secure_update: Handle --no-cache situation

63ffca3

When --no-cache is given, cache_dir is None. Use a temporary directory in this case.

secure_update: Add sanity check to target_name parsing

2c17f1d

Support secure update in download and wheel commands

f211e71

secure_update: Avoid logger.info() if debug() will do

d038df1

Silence most output on failing index download

0807bdf

TUF: raise NetWorkConnectionError on failing download

f55ae72

This is what seems to be expected when distribution download fails

secure_update: Handle UnknownTargetError

3316cc5

This happens e.g. when user tries to install a non-existent project.

secure_update: Log the same things as regular download

44e35f6

Also remove some comments about cache: it does work now

secure_update: Add belt-and-suspenders check for pypi

77b76f1

Add a list of index urls that really should work securely (in the future this should include pypi.org). Raise an exception if we do not find a secure downloader for an index url in the list.

secure_update: Improve comments

8416c3a

Fix all linter issues in secure_update related code

4492ffa

Also move the cache/data dir selection inside SecureUpdateSession.

list: support SecureUpdateSession

cf6ff71

Note that list is the only place that does multithreaded repository access: This will fail with TUF so the parallelism is disabled if TUF is being used.

Improve comments in SessionCommandMixin

e2e6d87

SessionCommandMixin now handles both PipSession and SecureUpdateSession.

secure_update: Improve docstrings and comments

457084d

Standardize parsing index url from project url

a4b02e5

We parse the index url from a project url in two places (before looking up a SecureDownloader): use similar methods and comments.

Update secure updater bootstrap metadata

da72ae7

This is test metadata only

Jussi Kukkonen added 10 commits October 22, 2020 15:15

secure_update: Use MissingLocalRepositoryError

4033662

This allows us to separate missing local metadata (which typically happens for every repository that we do not ship metadata for), and actual errors while reading local metadata. Requires tuf 0.15

secure_update: Update mirror configs to TUF 0.15

a42b4aa

New TUF allows us to * not specify targets_path or metadata_path if the mirror does not serve any targets or metadata respectively. * not specify confined_target_dirs (default is no confinement) Requires tuf 0.15

Vendor tuf and securesystemslib

9d9175d

This does not include the vendored sources themselves, run: vendoring sync -v The patches are a temporary hack: vendoring tool cannot cope with the import style used so this is a stop gap solution

Add the vendored sources for tuf and securesystemslib

0277f50

Fix linter issues

ba4f8f8

Merge branch 'tuf-mvp' into tuf-mvp-vendored

9bbd0e8

Fix linter issues

70353a5

vendoring: Update tuf files

4e424e3

the api package was added in tuf 0.15: it is not needed by the client LICENSE files were added in 0.15

test_build_env: Handle SecureUpdateSession

e4d385f

Merge branch 'tuf-mvp' into tuf-mvp-vendored

07db95e

brainwane mentioned this pull request Nov 3, 2020

Roadmap update for TUF support pypi/warehouse#5247

Closed

sechkova mentioned this pull request Nov 5, 2020

PoC: Let user download targets theupdateframework/python-tuf#1171

Closed

3 tasks

sechkova mentioned this pull request Dec 2, 2020

[WIP] tuf-client as a standalone installation theupdateframework/python-tuf#1221

Closed

3 tasks

fridex mentioned this pull request Jan 4, 2021

[spike] thamos integration of rekor thoth-station/core#262

Closed

BrownTruck added the needs rebase or merge PR has conflicts with current master label Feb 4, 2021

jku closed this Mar 5, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PEP-458 Implementation (Secure downloads with signed metadata) #9041

PEP-458 Implementation (Secure downloads with signed metadata) #9041

jku commented Oct 23, 2020 •

edited

Loading

brainwane commented Nov 3, 2020

pfmoore commented Nov 3, 2020

joshuagl commented Nov 3, 2020

pfmoore commented Nov 3, 2020

McSinyx commented Nov 4, 2020

jku commented Nov 4, 2020 •

edited

Loading

jku commented Nov 13, 2020 •

edited

Loading

pradyunsg commented Dec 29, 2020 •

edited

Loading

jku commented Jan 11, 2021

pradyunsg commented Jan 11, 2021

BrownTruck commented Feb 4, 2021

pradyunsg commented Mar 5, 2021

jku commented Mar 5, 2021

PEP-458 Implementation (Secure downloads with signed metadata) #9041

PEP-458 Implementation (Secure downloads with signed metadata) #9041

Conversation

jku commented Oct 23, 2020 • edited Loading

Purpose

Current state

Why the dependencies?

User visible changes

Design

Remaining Work

Discussion items

brainwane commented Nov 3, 2020

pfmoore commented Nov 3, 2020

joshuagl commented Nov 3, 2020

pfmoore commented Nov 3, 2020

McSinyx commented Nov 4, 2020

jku commented Nov 4, 2020 • edited Loading

jku commented Nov 13, 2020 • edited Loading

pradyunsg commented Dec 29, 2020 • edited Loading

jku commented Jan 11, 2021

pradyunsg commented Jan 11, 2021

BrownTruck commented Feb 4, 2021

pradyunsg commented Mar 5, 2021

jku commented Mar 5, 2021

jku commented Oct 23, 2020 •

edited

Loading

jku commented Nov 4, 2020 •

edited

Loading

jku commented Nov 13, 2020 •

edited

Loading

pradyunsg commented Dec 29, 2020 •

edited

Loading