Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement marker trees using algebraic decision diagrams #5898

Merged
merged 13 commits into from
Aug 9, 2024

Conversation

ibraheemdev
Copy link
Member

@ibraheemdev ibraheemdev commented Aug 7, 2024

Summary

This PR rewrites the MarkerTree type to use algebraic decision diagrams (ADD). This has many benefits:

  • The diagram is canonical for a given marker function. It is impossible to create two functionally equivalent marker trees that don't refer to the same underlying ADD. This also means that any trivially true or unsatisfiable markers are represented by the same constants.
  • The diagram can handle complex operations (conjunction/disjunction) in polynomial time, as well as constant-time negation.
  • The diagram can be converted to a simplified DNF form for user-facing output.

The new representation gives us a lot more confidence in our marker operations and simplification, which is proving to be very important (see #5733 and #5163).

Unfortunately, it is not easy to split this PR into multiple commits because it is a large rewrite of the marker module. I'd suggest reading through the marker/algebra.rs, marker/simplify.rs, and marker/tree.rs files for the new implementation, as well as the updated snapshots to verify how the new simplification rules work in practice. However, a few other things were changed:

There is still some remaining work here that I decided to leave for now for the sake of unblocking some of the related work on the resolver.

  • We still use Option<MarkerTree> throughout uv, which is unnecessary now that MarkerTree::TRUE is canonical.
  • The MarkerTree type is now interned globally and can potentially implement Copy. However, it's unclear if we want to add more information to marker trees that would make it !Copy. For example, we may wish to attach extra and requires-python environment information to avoid simplifying after construction.
  • We don't currently combine python_full_version and python_version markers.
  • I also have not spent too much time investigating performance and there is probably some low-hanging fruit. Many of the test cases I did run actually saw large performance improvements due to the markers being simplified internally, reducing the stress on the old normalize routine, especially for the extremely large markers seen in transformers and other projects.

Resolves #5660, #5179.

@ibraheemdev ibraheemdev added the lock Related to universal resolution and locking label Aug 7, 2024
@ibraheemdev ibraheemdev force-pushed the ibraheem/canonical-markers branch 5 times, most recently from f86e792 to 2b5c772 Compare August 8, 2024 17:00
Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is next-level awesome. Solves a huge problem we've been having with marker expressions and solves it well.

What do you think about making MarkerTree::and and MarkerTree::or smart constructors instead of routines that mutate trees in place? Or perhaps that's for another PR, in order to keep the amount of code changed outside of pep508_rs smaller.

crates/pep508-rs/Cargo.toml Outdated Show resolved Hide resolved
//! (> '3.7') -> os_name:
//! (> 'Linux') -> FALSE
//! (== 'Linux') -> TRUE
//! (< 'Linux') -> FALSE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this isn't (!= 'Linux') -> FALSE and (== 'Linux') -> TRUE? Why the inequalities?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I had to guess, I'd say it's because the inequalities are more general and easier to merge with other expressions?

Copy link
Member Author

@ibraheemdev ibraheemdev Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. Even if we did merge the two FALSE children, we would have to represent the != 'Linux' as (< 'Linux') or (> 'Linux') with pubgrub ranges. We could merge them to reduce the size of the tree, but for now this is simpler.

crates/pep508-rs/src/marker/algebra.rs Outdated Show resolved Hide resolved
crates/pep508-rs/src/marker/algebra.rs Show resolved Hide resolved
crates/pep508-rs/src/marker/algebra.rs Outdated Show resolved Hide resolved
crates/pep508-rs/src/marker/algebra.rs Show resolved Hide resolved
crates/pep508-rs/src/marker/algebra.rs Outdated Show resolved Hide resolved
edges
}

/// Merge two [`Edges`], applying the given function to all disjoint, intersecting edges.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some trouble grok'ing this routine. The map2 name I think also confused me a bit. Perhaps expand the comment here some?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some more general documentation. Is it clearer now?

crates/pep508-rs/src/marker/simplify.rs Show resolved Hide resolved
/// expression.
///
/// If the marker is `true`, this method will return `None`.
/// If the marker is `false`, the marker is represented as the normalized expression, `python_version < '0'`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was a nice API decision. Prevents one from emitting an "invalid" marker, but still makes it easy to print a debug repr.

@ibraheemdev ibraheemdev force-pushed the ibraheem/canonical-markers branch 2 times, most recently from 5b18a9c to db25cf1 Compare August 8, 2024 19:39
@ibraheemdev
Copy link
Member Author

ibraheemdev commented Aug 8, 2024

What do you think about making MarkerTree::and and MarkerTree::or smart constructors instead of routines that mutate trees in place? Or perhaps that's for another PR, in order to keep the amount of code changed outside of pep508_rs smaller.

Yes, I also want to remove uses of Option<MarkerTree> now that MarkerTree::TRUE exists. Will do in a follow-up PR.

Copy link
Member

@konstin konstin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ADDs are much better than my naive DNF idea. All the maths looks well-implemented.

crates/pep508-rs/src/marker/algebra.rs Outdated Show resolved Hide resolved
crates/pep508-rs/src/marker/algebra.rs Outdated Show resolved Hide resolved
crates/pep508-rs/src/marker/algebra.rs Show resolved Hide resolved
/// For non-boolean variables, this is more complex. See `apply_ranges` for details.
///
/// Note that the LHS and RHS must be of the same [`Edges`] variant.
fn apply(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ever used with something other than AND?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but it's possible that we may want to reuse this for other operations in the future (such as XOR or IFF).

crates/pep508-rs/src/marker/algebra.rs Show resolved Hide resolved
crates/pep508-rs/src/marker/simplify.rs Show resolved Hide resolved
"python_full_version <= '3.10' and python_version < '3.10'",
"python_full_version <= '3.10' and python_version < '3.10' and python_version > '3.10'",
"python_full_version <= '3.10' and python_version > '3.10'",
"python_full_version > '3.10' and python_version > '3.10'",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@@ -6471,23 +6471,23 @@ fn no_strip_markers_multiple_markers() -> Result<()> {
----- stdout -----
# This file was autogenerated by uv via the following command:
# uv pip compile --cache-dir [CACHE_DIR] requirements.in --no-strip-markers --python-platform windows
attrs==23.2.0 ; python_version > '3.11' or sys_platform == 'win32'
attrs==23.2.0 ; sys_platform == 'win32' or python_version > '3.11'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the ordering defined now, is it the ordering of the marker variable flowing down? It looks like in conjunctions and disjunctions the ordering is now different. Not high priority as long as the ordering is deterministic and robust, but might be easy enough to justify changing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the ordering of the markers flowing down yeah. In this case the original expression would have been (python_version <= '3.11' and sys_platform == 'win32') or python_version > '3.11 before simplifying, which is why the order appears reversed despite python_version being the highest order variable.

I guess we could sort these the same way we used to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, doing a final sort on simplified markers is preferential for users not seeing a change in equivelent output if something about the resolution itself changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resolution is still deterministic, but sorting may result in a more syntactically consistent order.

Copy link
Contributor

@notatallshaw notatallshaw Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant in a scenario where the input changed, (e.g. the user changed the order, or added requirements that were previously transative dependencies, or new versions of packages changed the requirements in some similiar sense), but the output was equivelent (e.g. same packages, same versions, same markers), but perhaps the marker order could be affected.

Copy link
Member Author

@ibraheemdev ibraheemdev Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should never happen. The decision diagram is canonical for any functionally equivalent marker and the simplification routine is deterministic. All sorting could do is give us a different canonical output for a given marker, but it shouldn't affect whether or not it is canonical.


// Simplify all marker expressions based on the requires-python bound.
//
// This is necessary to ensure the a `Lock` deserialized from a lockfile compares
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between fresh and from lock? We always know the current requires-python in the resolver (we determine it on fork construction), so marker construction should already have this information.

Also note that requires-python is the bound for the entire lockfile, but individual forks can have stricter bounds, e.g. say overall is >=3.7, but there's one fork >=3.7,<3.8 and one fork >=3.8

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is emulating the call to simplify_python_versions when constructing the ResolutionGraph.

I agree that marker construction should always have this information, I would like to avoid simplifying requires-python and extras in a separate step.

@@ -106,7 +106,8 @@ impl<'a, Context: BuildContext> SourceTreeResolver<'a, Context> {
origin: Some(origin.clone()),
marker: requirement
.marker
.and_then(|marker| marker.simplify_extras(extras)),
.map(|marker| marker.simplify_extras(extras))
.filter(|marker| !marker.is_true()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is_false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think !marker.is_true() does not imply marker.is_false(). I do like the succinctness of the method names, but they do suggest a mutual exclusivity that isn't there. Perhaps is_always_true and is_always_false are better names? Not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, totally missed that

@konstin
Copy link
Member

konstin commented Aug 9, 2024

Could you also a top level doc comment sentence that we implement all or as and by demorgan and cheap negation?

@ibraheemdev ibraheemdev force-pushed the ibraheem/canonical-markers branch 2 times, most recently from 8ed4dab to 8cb5abe Compare August 9, 2024 15:11
@ibraheemdev ibraheemdev merged commit ffd18cc into main Aug 9, 2024
57 checks passed
@ibraheemdev ibraheemdev deleted the ibraheem/canonical-markers branch August 9, 2024 17:40
zanieb pushed a commit that referenced this pull request Aug 9, 2024
This PR rewrites the `MarkerTree` type to use algebraic decision
diagrams (ADD). This has many benefits:
- The diagram is canonical for a given marker function. It is impossible
to create two functionally equivalent marker trees that don't refer to
the same underlying ADD. This also means that any trivially true or
unsatisfiable markers are represented by the same constants.
- The diagram can handle complex operations (conjunction/disjunction) in
polynomial time, as well as constant-time negation.
- The diagram can be converted to a simplified DNF form for user-facing
output.

The new representation gives us a lot more confidence in our marker
operations and simplification, which is proving to be very important
(see #5733 and
#5163).

Unfortunately, it is not easy to split this PR into multiple commits
because it is a large rewrite of the `marker` module. I'd suggest
reading through the `marker/algebra.rs`, `marker/simplify.rs`, and
`marker/tree.rs` files for the new implementation, as well as the
updated snapshots to verify how the new simplification rules work in
practice. However, a few other things were changed:
- [We now use release-only comparisons for `python_full_version`, where
we previously only did for
`python_version`](https://github.com/astral-sh/uv/blob/ibraheem/canonical-markers/crates/pep508-rs/src/marker/algebra.rs#L522).
I'm unsure how marker operations should work in the presence of
pre-release versions if we decide that this is incorrect.
- [Meaningless marker expressions are now
ignored](https://github.com/astral-sh/uv/blob/ibraheem/canonical-markers/crates/pep508-rs/src/marker/parse.rs#L502).
This means that a marker such as `'x' == 'x'` will always evaluate to
`true` (as if the expression did not exist), whereas we previously
treated this as always `false`. It's negation however, remains `false`.
- [Unsatisfiable markers are written as `python_version <
'0'`](https://github.com/astral-sh/uv/blob/ibraheem/canonical-markers/crates/pep508-rs/src/marker/tree.rs#L1329).
- The `PubGrubSpecifier` type has been moved to the new `uv-pubgrub`
crate, shared by `pep508-rs` and `uv-resolver`. `pep508-rs` also depends
on the `pubgrub` crate for the `Range` type, we probably want to move
`pubgrub::Range` into a separate crate to break this, but I don't think
that should block this PR (cc @konstin).

There is still some remaining work here that I decided to leave for now
for the sake of unblocking some of the related work on the resolver.
- We still use `Option<MarkerTree>` throughout uv, which is unnecessary
now that `MarkerTree::TRUE` is canonical.
- The `MarkerTree` type is now interned globally and can potentially
implement `Copy`. However, it's unclear if we want to add more
information to marker trees that would make it `!Copy`. For example, we
may wish to attach extra and requires-python environment information to
avoid simplifying after construction.
- We don't currently combine `python_full_version` and `python_version`
markers.
- I also have not spent too much time investigating performance and
there is probably some low-hanging fruit. Many of the test cases I did
run actually saw large performance improvements due to the markers being
simplified internally, reducing the stress on the old `normalize`
routine, especially for the extremely large markers seen in
`transformers` and other projects.

Resolves #5660,
#5179.
@charliermarsh
Copy link
Member

Amazing work.

We don't currently combine python_full_version and python_version markers.

Does this mean we no longer compute unions, etc., to combine or markers of versions? (Or we don't combine python_version with python_full_version?)

@ibraheemdev
Copy link
Member Author

We compute unions of python_version and python_full_version independently, but we don't combine them.

@charliermarsh
Copy link
Member

Makes sense thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lock Related to universal resolution and locking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

uv pip compile --universal markers not sufficiently simplifying
6 participants