Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hash-checking support to install and sync #2945

Merged
merged 3 commits into from
Apr 10, 2024

Conversation

charliermarsh
Copy link
Member

@charliermarsh charliermarsh commented Apr 9, 2024

Summary

This PR adds support for hash-checking mode in pip install and pip sync. It's a large change, both in terms of the size of the diff and the modifications in behavior, but it's also one that's hard to merge in pieces (at least, with any test coverage) since it needs to work end-to-end to be useful and testable.

Here are some of the most important highlights:

  • We store hashes in the cache. Where we previously stored pointers to unzipped wheels in the archives directory, we now store pointers with a set of known hashes. So every pointer to an unzipped wheel also includes its known hashes.
  • By default, we don't compute any hashes. If the user runs with --require-hashes, and the cache doesn't contain those hashes, we invalidate the cache, redownload the wheel, and compute the hashes as we go. For users that don't run with --require-hashes, there will be no change in performance. For users that do, the only change will be if they don't run with --generate-hashes -- then they may see some repeated work between resolution and installation, if they use pip compile then pip sync.
  • Many of the distribution types now include a hashes field, like CachedDist and LocalWheel.
  • Our behavior is similar to pip, in that we enforce hashes when pulling any remote distributions, and when pulling from our own cache. Like pip, though, we don't enforce hashes if a distribution is already installed.
  • Hash validity is enforced in a few different places:
    1. During resolution, we enforce hash validity based on the hashes reported by the registry. If we need to access a source distribution, though, we then enforce hash validity at that point too, prior to running any untrusted code. (This is enforced in the distribution database.)
    2. In the install plan, we only add cached distributions that have matching hashes. If a cached distribution is missing any hashes, or the hashes don't match, we don't return them from the install plan.
    3. In the downloader, we only return distributions with matching hashes.
    4. The final combination of "things we install" are: (1) the wheels from the cache, and (2) the downloaded wheels. So this ensures that we never install any mismatching distributions.
  • Like pip, if --require-hashes is provided, we require that all distributions are pinned with either == or a direct URL. We also require that all distributions have hashes.

There are a few notable TODOs:

  • We don't support hash-checking mode for unnamed requirements. These should be somewhat rare, though? Since pip compile never outputs unnamed requirements. I can fix this, it's just some additional work.
  • We don't automatically enable --require-hashes with a hash exists in the requirements file. We require --require-hashes.

Closes #474.

Test Plan

I'd like to add some tests for registries that report incorrect hashes, but otherwise: cargo test

@charliermarsh charliermarsh added the enhancement New feature or improvement to existing functionality label Apr 9, 2024
@charliermarsh charliermarsh marked this pull request as ready for review April 9, 2024 20:15
@charliermarsh charliermarsh marked this pull request as draft April 9, 2024 20:16
@charliermarsh charliermarsh force-pushed the charlie/check-hashes-iii branch 9 times, most recently from 4fd10e2 to 1d8c657 Compare April 9, 2024 20:27
},
Self::NoBinary => false,
Self::MismatchedHash => false,
Self::MissingHash => false,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zanieb - I could use your help with this part. I'm not sure if I did the comparisons correctly here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is actually wrong, although I'm not sure if it matters. You're supposed to be enforcing an ordering but here you're saying that MissingHash and MismatchedHash are never more compatible than the other value. This means that if we see both MissingHash and MismatchedHash we would arbitrarily display the first one we saw instead of preferring to present one of them to the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to change up the strategy in #2949, such that we treat distributions without hashes as compatible (but lower-priority). I can merge that into this PR if you agree with the change.

other => other,
}
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BurntSushi - Do you mind reviewing this component? The basic idea is a wrapper around any reader that can compute a set of hashes as we go. It's then used to wrap the async streams as we unzip.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye. I think I have just one concern but otherwise this looks pretty reasonable to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tyvm!

@charliermarsh charliermarsh force-pushed the charlie/check-hashes-iii branch 6 times, most recently from 6ee405c to 4d5fbe0 Compare April 9, 2024 21:18
@charliermarsh charliermarsh marked this pull request as ready for review April 9, 2024 21:18
@charliermarsh charliermarsh force-pushed the charlie/check-hashes-iii branch 3 times, most recently from f3edd39 to 60019b0 Compare April 9, 2024 21:23
@charliermarsh
Copy link
Member Author

Open to advice on how best to test that if a registry tells us the wrong hashes, we still validate them at install time. I need a registry or --find-links entry that reports the wrong hashes for a given distribution.

@charliermarsh charliermarsh requested review from konstin and removed request for BurntSushi April 9, 2024 21:27
@charliermarsh charliermarsh force-pushed the charlie/check-hashes-iii branch 3 times, most recently from 7ba9862 to b1e9f80 Compare April 9, 2024 21:41
@charliermarsh
Copy link
Member Author

charliermarsh commented Apr 9, 2024

A few TODOs:

  • Add some tests for invalid indexes.
  • Run a basic benchmark (especially, to ensure that there's no regression for non---require-hashes).

I'd also like to refactor the dataflow from requirements file to RequireHashes, but this isn't required to merge.

@charliermarsh charliermarsh force-pushed the charlie/check-hashes-iii branch 3 times, most recently from 7f970bf to f66887c Compare April 9, 2024 22:02
);

// Third, request the correct hash, that the registry _thinks_ is correct, but without the
// cache. We _should_ accept it, but we currently don't.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a TODO. We reject distributions without hashes when --require-hashes is provided, whereas in reality we should compute them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I solved this in #2949.

@charliermarsh charliermarsh force-pushed the charlie/check-hashes-iii branch 3 times, most recently from 32a2088 to d8f5d37 Compare April 9, 2024 23:56
@konstin
Copy link
Member

konstin commented Apr 10, 2024

Not exactly this PR (more #2909), but do we have an explanation for .http/.rev files in the docstrings?

Copy link
Member

@konstin konstin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@@ -93,6 +93,7 @@ indoc = { version = "2.0.4" }
itertools = { version = "0.12.1" }
junction = { version = "1.0.0" }
mailparse = { version = "0.14.0" }
md-5 = { version = "0.10.6" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to support md5? It's cryptographically broken

crates/uv-client/src/cached_client.rs Outdated Show resolved Hide resolved
Comment on lines +9 to +12
/// The path to the archive entry in the wheel's archive bucket.
pub path: PathBuf,
/// The computed hashes of the archive.
pub hashes: Vec<HashDigest>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we have a constructor and getters

Suggested change
/// The path to the archive entry in the wheel's archive bucket.
pub path: PathBuf,
/// The computed hashes of the archive.
pub hashes: Vec<HashDigest>,
/// The path to the archive entry in the wheel's archive bucket.
path: PathBuf,
/// The computed hashes of the archive.
hashes: Vec<HashDigest>,

};
let mut hashers = algorithms.into_iter().map(Hasher::from).collect::<Vec<_>>();
let mut hasher = uv_extract::hash::HashReader::new(file, &mut hashers);
uv_extract::stream::unzip(&mut hasher, temp_dir.path()).await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we spawn blocking here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's async whereas the other version is sync. Does that make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah

crates/uv-distribution/src/error.rs Outdated Show resolved Hide resolved
crates/uv-extract/src/hash.rs Show resolved Hide resolved

/// Exhaust the underlying reader.
pub async fn finish(&mut self) -> Result<(), std::io::Error> {
while self.read(&mut [0; 8192]).await? > 0 {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially putting 8KB on the stack. Maybe not an issue, but it's big enough to stand out to me.

I think I'd probably just use vec![0; 8192] here instead. You could put it in a thread_local! to amortize the alloc if you're concerned about perf.

other => other,
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye. I think I have just one concern but otherwise this looks pretty reasonable to me.

Base automatically changed from charlie/check-hashes to main April 10, 2024 18:07
charliermarsh added a commit that referenced this pull request Apr 10, 2024
## Summary

This lets us remove circular dependencies (in the future, e.g., #2945)
that arise from `FlatIndex` needing a bunch of resolver-specific
abstractions (like incompatibilities, required hashes, etc.) that aren't
necessary to _fetch_ the flat index entries.
@charliermarsh charliermarsh enabled auto-merge (squash) April 10, 2024 19:05
@charliermarsh charliermarsh merged commit 1f3b5bb into main Apr 10, 2024
37 checks passed
@charliermarsh charliermarsh deleted the charlie/check-hashes-iii branch April 10, 2024 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or improvement to existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants