Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tarball: Use cargo_toml to parse Cargo.toml file #6914

Merged
merged 1 commit into from
Aug 6, 2023

Conversation

Turbo87
Copy link
Member

@Turbo87 Turbo87 commented Aug 1, 2023

This PR switches the crates_io_tarball package over to using the https://crates.io/crates/cargo_toml for Cargo.toml file parsing.

This should significantly lower the chances of invalid Cargo.toml files ending up on crates.io, since a lot more fields are now properly validated.

Note that cargo_toml supports workspace inheritance by default, so we have to run an additional validation step to ensure that our server only accepts manifest files without any workspace inheritance.

Similarly, the crate accepts rust-version as any random String, so I had to move our validation check out of the serde deserialization code path.

@Turbo87 Turbo87 added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-backend ⚙️ labels Aug 1, 2023
@Turbo87 Turbo87 requested a review from a team August 1, 2023 11:55
@Turbo87
Copy link
Member Author

Turbo87 commented Aug 1, 2023

I just ran the check_all_crates example binary and we only have two more failures with this change:

 WARN Failed to process tarball pkg_name=geogrid-0.3.0 path=tmp/all-crates/ge/og/geogrid/geogrid-0.3.0.crate error=Cargo.toml manifest is invalid: TOML parse error at line 25, column 7
   |
25 | ocl = {version = "0.12", optional = "true"}
   |       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
data did not match any variant of untagged enum Dependency

and

 WARN Failed to process tarball pkg_name=cargo-registry-0.1.0 path=tmp/all-crates/ca/rg/cargo-registry/cargo-registry-0.1.0.crate error=Cargo.toml manifest is invalid: missing field `package`

(ironic, I know... 😂)

Err(de::Error::invalid_value(value, &expected))
}
}
use cargo_toml::{Dependency, DepsSet, Error, Inheritable, Manifest, Package};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the extra validation worth it?

I'm concerned about us (very easily) adding new things that will then fail on publish. The person publishing would then need to figure out it is a bug in crates.io, report it, someone respond, create a PR upstream with cargo_toml, wait for a fix and publish, then updates and deploy crates.io.

Copy link
Member Author

@Turbo87 Turbo87 Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our long term goal is to reduce the reliance on the metadata JSON blob that is being sent together with the tarball.

an attacker could potentially publish something where the metadata says foo and the tarball says bar, and we would prefer to treat the tarball as the primary source of truth.

this however requires us to parse the Cargo.toml file and if parsing fails we can't accept the upload. we could keep rolling our own structure definitions, but as we've already seen with the readme field, this is quite error-prone.

as the PR description says cargo_toml works for almost all existing crates, so there isn't really a reason to write everything ourselves.

the extra validation is technically not needed if we assume that only cargo is used to publish new crates, but only relying on client-side validation seems quite risky given scenarios like https://blog.vlt.sh/blog/the-massive-hole-in-the-npm-ecosystem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be reasonable to add logging / alerting such that we could detect an increase in failed toml parsing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, that seems like a reasonable idea :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about us (very easily) adding new things that will then fail on publish.

Wouldn't a breaking change like that cause bigger problems anyway, since crates.io is by no means the only downstream parser and consumer of cargo manifests?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even talking about breaking backwards compatibility but breaking forward compatibility.

Think of cases like:

  • Add new supported values
  • Add new supported types

Cargo needs to have a strict parser for these because an unknown value could have a major impact on behavior. If another tool is acting off of the field, they should hard error as well. However, we need to be able to add new values or else cargo is stuck as-is. This is possible to support in serde using a catch-all variant. cargo_toml instead errors on unknown values.

A trickier one is adding new types which we've done but cargo_metadata also has hard errors for that.

So assuming we only maintain backward compatibility. Most likely we'll be playing whack-a-mole with cargo_toml and cargo_metadata for them to properly handle new values being emitted.

We then have to deal with making sure they are updated when the new values are out (at minimum, a strong communication path).

Tying this back to this PR, making most of cargo_toml loosey-goosey for compatibility removes most of the validation that crates-io can do.

Independent of all of that., in my opinion, is that pushing responsibility for this stuff to third-parties has made it so we are unaware of the problems their users face and it is easier to make innocuous changes that break them

@Turbo87
Copy link
Member Author

Turbo87 commented Aug 6, 2023

we've discussed this PR in the crates.io team meeting last friday and decided to move ahead with this change. we might eventually migrate the code to use https://github.com/LukeMathWalker/cargo-manifest instead, but since it is currently still a bit rough around the edges this will be a follow-up PR.

@Turbo87 Turbo87 merged commit bc3dbf6 into rust-lang:main Aug 6, 2023
@Turbo87 Turbo87 deleted the cargo-toml branch August 6, 2023 21:42
@ehuss
Copy link
Contributor

ehuss commented Aug 15, 2023

@Turbo87 The cargo team had some concerns over the specifics of this approach and how it could cause problems for introducing changes in cargo. I was wondering if you would be willing to discuss possible changes to help avoid problems for us and our users?

For example, this would have prevented people from publishing if they started using the new debug profile settings that were recently introduced in cargo. We don't have any mechanism in place to identify when there will be problems. And users won't hit it until usually after stabilization, which could result in a frustrating experience for users when a new feature doesn't work (and doesn't work until the very last step of development with the publishing step). Using third-party crates out of our control also means there would be an indefinite timeline for fixing it, or require more work on your part to fork and patch things.

Are there options or ideas that we could discuss to try to avoid these problems?

I think a long-term goal is possibly for the cargo team to provide a more stable set of serde-based structures, and then we could have a process to immediately push those changes to crates.io and get it updated here. But I think that will likely be a long while for that to settle. Are there alternatives, like possibly using untyped parsers like toml::Value and only inspect the fields that you are interested in? Are there other options that we could take in the meantime?

@Turbo87
Copy link
Member Author

Turbo87 commented Aug 16, 2023

We don't have any mechanism in place to identify when there will be problems

you can always try to publish to the crates.io staging environment :)

And users won't hit it until usually after stabilization

our assumption is that the cargo team notifies us about planned changes to the manifest format, so that we can implement them on our side. I don't expect this to take more than a couple of days, so by the time a change gets stabilized our server should support it.

Using third-party crates out of our control also means there would be an indefinite timeline for fixing it, or require more work on your part to fork and patch things.

yes, we are aware of this tradeoff. the upside is that we don't have to maintain all of it ourselves if there already exist at least three such libraries on crates.io.

we also discussed migrating to cargo-manifest instead, since the cargo_toml maintainer does not appear particularly responsive to pull requests. I have a WIP branch for that, but haven't finished it yet due to vacation.

If we indeed see that the maintainers of these crates are indeed not responsive to planned changes we can always fork, but until then I'd rather not increase our maintenance burden unnecessarily.

I think a long-term goal is possibly for the cargo team to provide a more stable set of serde-based structures, and then we could have a process to immediately push those changes to crates.io and get it updated here.

👍

Are there alternatives, like possibly using untyped parsers like toml::Value and only inspect the fields that you are interested in? Are there other options that we could take in the meantime?

that again would mean maintaining yet another implementation, while these third party crates would have to eventually be updated anyway to keep working.

@epage
Copy link

epage commented Sep 16, 2024

As another failure mode of this situation, cargo-manifest advertises itself as being used by crates.io
https://docs.rs/cargo-manifest/latest/cargo_manifest/

which comes across as an endorsement from the Project
https://rust-lang.zulipchat.com/#narrow/stream/246057-t-cargo/topic/Get.20manifest.20of.20crates.2Eio.20crate.20with.20local.20cargo.20cache

when the Cargo team has specifically not endorsed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants