-
Notifications
You must be signed in to change notification settings - Fork 22
gps for Contributors
Hi! We're thrilled you're thinking about contributing to gps
. Feedback on
these docs, or the library itself, is always welcome! 🎉 🎉
These docs provide higher-level insight into how gps
works, with the goal of making it easier to wrap your head around it all. CONTRIBUTING.md has more of the logistical info. Please also note that gps
abides by a code of conduct, and is MIT-licensed.
WARNING: these docs still need a lot of love - we're focusing more on the
documentation for implementors right now. Familiarity
with those is mostly assumed anyway in these documents, as it's hard to make
sense of why gps
's internals work as they do without understanding its
intended inputs and outputs.
Constraint solving as an approach to dependency management is one of those rare
things that's nice in theory, and can be even nicer in practice. Constraint
solvers have a general form that suggest a natural architecture for the
software: there's the solver itself, which is basically solving a math problem,
versus the parts responsible for learning about the world and turning it into
the math-y symbols the solver expects. This separation of responsibilities
helps ensure the tail doesn't end up wagging the dog, even as we add
progressively more complex checks to the solver. And indeed, this is gps
's
basic architecture:
The solver
explores
the possible solutions (i.e., which dependencies to use at which versions),
ensuring at each step that all known constraints are met. As it explores, it
asks a
SourceManager
for
information about the world (e.g., what versions of a dep exist, what deps and
constraints does that dep have, etc.) that it needs to continue exploration.
solver
itself never touches disk or network. It works entirely on
abstractions -
names,
versions, package names
and import paths as
strings, and others.
As a high-level representation of gps
, this picture is mostly complete. It's missing one thing, though - gps
also uses
an adapter between solver
and SourceManager
, called the
bridge
. In fact,
solver
never actually talks to SourceManager
directly; it only does so
through the bridge
.
For the most part, the bridge
just forwards calls from solver
to
SourceManager
. The difference is that the bridge
is focused solely on this
particular solver run, with its current inputs - the active project, at some
path on disk, with whatever happens to be in the vendor/
directory, whether
or not the working copy has changes in it. SourceManager
, on the other hand,
deals strictly with unmodified, upstream code, originally sourced from some
remote repository.
The bridge
transparently fields calls made from solver
that deal with the
current/root project, while passing along other requests to the
SourceManager
. This division not only helps keep the SourceManager
's
implementation cleaner, but it also paves the way for two bigger-picture goals:
- The expensive metadata computed by
SourceManager
can be aggressively cached, reused across runs, and potentially even across different tool implementations. - Because the
SourceManager
deals only with upstream information, we could eventually swap in an implementation that uses a central server.
Together, solver
, bridge
, and SourceManager
are gps
's three main
components; everything that the engine does can be understood in relation to
them.
The solver is an incremental algorithm. The formal definition doesn't really matter; just keep two things in mind:
- the solver focuses completely on one node (project or package set) at a time
- its current state always holds a viable solution
Probably the easiest way to illustrate the solving algorithm's behavior is by working backwards from the constraints it's trying to satisfy. To illustrate that, we need a simple depgraph:
There's our Root
project's two dependencies, A
, B
, and a dependency C
that's shared between A
and B
. (We'll return later to what exactly this
information is, and where it comes from). The solver's goal is to pick a
version for each of the dependencies that's agreeable for every constraint
producer in the depgraph.
Now, while that image suggests we "know" that A
, B
, and C
are the overall
set of requirements, the solver doesn't actually know that when it starts. It only knows
about direct dependencies, A
and B
; C
is discovered through exploration.
And exploration, in fact, is the solving process. Beginning from only the
root project, the solve loop goes roughly like this:
- Pick a dependency that is required, but for which a version has not yet been selected
-
Pick a version of that dependency (we call ident + version an
atom
) from a queue of available versions -
Validate that that
atom
meets all current constraints- This certainly means passing constraints set by other selected atoms, but can also include things like ensuring no import cycles would be created
- If validation succeeds, select that
atom
, and proceed to the next required, but not-yet-selected dep- Selecting the
atom
also means adding any new dependencies it introduces to the unselected queue
- Selecting the
- If it does not, try validation with the next version in the version queue
- If we're out of versions to try, backtrack to a previous failure and try new versions of that, in hopes it might open new possibilities
- If backtracking fails, then solving failed
That's not the easiest thing to follow in the abstract, so here's an example. (At some point, we hope to replace this text with a fancy iterative visualization!)
- Start with dependencies on
A
andB
in unselected queue - Pick
A
to explore; discover that it has tag versions1.1.0
and1.1.1
available - Attempt atom
A
@1.1.1
- It passes validation; select it, and discover dependency
C
with constraint=2.0.1
- Add
C
to the unselected queue - Pick
C
to explore; discover that it has tag versions2.0.0
and2.0.1
available - Attempt atom
C
@2.0.1
- That atom meets
A
's constraint, so it passes validation; select it - Pick
B
to explore (it's all that's left in the unselected queue); discover that it has tag versions1.0.0
- Attempt atom
B
@1.0.0
- Validation fails: atom has a dependency on
C
@=2.0.0
, which is disjoint withA
@1.1.1
's dependency onC
@=2.0.1
- No more versions of
B
to try; begin backtracking -
C
was last selected; attempt its next version,2.0.0
- Validation fails:
C
@2.0.0
also does not meetA
@1.1.1
's requirement ofC
@=2.0.1
- No more versions of
C
to try; backtrack again -
A
was last selected; attempts its next version,1.1.0
- Validation succeeds; select it, and discover dependency
C
@=2.0.0
- Add
C
to the unselected queue - Pick
C
to explore; it still has2.0.0
and2.0.1
available - Attempt atom
C
@2.0.1
- Validation fails:
A
@1.1.0
wantsC
@=2.0.0
- Attempt atom
C
@2.0.0
- Validation succeeds; select
C
@2.0.0
- Pick
B
to explore; it still has just1.0.0
- Attempt atom
B
@1.0.0
- Validation succeeds;
B
@1.0.0
wantsC
@=2.0.0
, which is what we have already. SelectB
@1.0.0
- Unselected queue is empty - we found a solution!
-
A
@1.1.0
-
B
@1.0.0
-
C
@2.0.0
-
The important thing to note here is that this is a tightly-constrained
iterative algorithm: at each step, we're only checking selecting a given atom
would result in a valid overall state with respect to what we currently know.
We don't go traipsing off along transitive paths, because we know we'll get
there eventually. For example, we selected C
@2.0.1
even though B
couldn't
have accepted it, because we hadn't looked at B
yet. This is slightly
wasteful (the real algorithm has some means for minimizing wasted work), but
it's also fully general.
There are some crucial aspects of the solver this example overlooks - where all the dependency information comes from in the first place, how lock data factor in, initial solver inputs, package-level vs. project-level analysis, and other non-version types of constraints. Let's tackle each of these in turn.
It's all well and good to talk about dependencies and versions in the abstract, but we need to understand where that information comes from. The implementor docs give an outsider's perspective on this, which is worth reviewing, but contributors need more detail.
First, we have to break down the different types of dependency-related
information that gps
trafficks in.
-
Manifest
: Manifests are primarily for providing version constraints on the set of imports expressed in packages at or below aProjectRoot
. -
Constraint
: a (version) constraint is a rule that can decide whether a given version is acceptable. Constraints can be combined via theirIntersect
method. -
Version
: Versions are labels that identify a particular snapshot of time in a code repository.gps
has several types - revisions, semver tags, non-semver tags, and branches - which have different properties, and varying equivalency relationships. More on them later, but do note that allVersion
s are alsoConstraint
s, but not vice-versa. -
Imports: Good ol' standard Go import paths, discovered by statically
analyzing Go code with
go/build
andgo/parser
. -
Identifiers
: Identifiers are related to import paths, but slightly different:- An import path points to a package. These names point only to repository
roots, called
ProjectRoots
(which may or may not also be a valid Go package). - They allow tools to explicitly state the network location from which a local import path should be sourced.
- An import path points to a package. These names point only to repository
roots, called
Of all of these, gps
extracts only imports and versions itself, through the
SourceManager
. Manifests, names, and constraints all come from the
ProjectAnalyzer
,
which must be injected into the
SourceManager
by the tool implementing gps
. Again, the implementor
docs have more details on ProjectAnalyzer
.
This division of responsibility is intentional. gps
deals with unambiguous
information : Go's ecosystem is VCS-driven, so versions come from version
control; go/parser
provides standard, clear answers about Go source
code, including imports. The other issues are left up to the implementing tool.
More concretely, most of the information the solver operates on comes from one of
three SourceManager
methods:
-
ListVersions(ProjectName) ([]Version, error)
: This method takes a name - an import path corresponding to a repository root - and returns a slice of versions representing the current set of versions - branches and tags - offered by the upstream repository. -
ListPackages(ProjectName, Version) (PackageTree, error)
: given a name AND a version, and performs static analysis on the tree rooted at that import path, at that version, and returns the result. -
GetManifestAndLock(ProjectName, Version) (Manifest, Lock, error)
: given a name and a version, this checks the right version out to disk, then asks the injectedProjectAnalyzer
to analyze that tree and provide us with a manifest and/or lock.
For the user, a lock file is the source of their build's stability and
reproducibility: it describes exactly which versions (revisions, ideally)
should be used for each dependency. Within the solver, however, locks don't
play much of a role. Mostly, we care about Manifests
, as that's where
constraint information comes from, whereas Locks
are quite
literally the solution we're
looking to produce produce.
But there's one exception. We do care about the root project's lock file; unless explicitly instructed otherwise, we want to produce a solution that's as close to what's in that lock as possible.
If it's not already clear, ordering is exceedingly important to the solver. The solution we end up picking is hugely dependent on the order in which we attempt the various available versions for each of our dependencies. If the first atom we visit works, then we'll never need to try another one. That basic fact is exactly what we exploit in order to retain the version specified in a lock file, if at all possible.
When we create a version queue for a project that's being visited for the first time, we check to see if the root's lock file (if present) names an atom for that project. If it does, and the provided version is acceptable with respect to current constraints, then we push it onto the front of the version queue. Doing so is sufficient to guarantee that the locked version will be used unless something else directly conflicts with it.
...almost sufficient, anyway. It's possible that, in search of a solution, the solver might backtrack to the locked atom and search for a new solution by changing it. Even if that ends up successfully finding a solution, there may have been another another possible solution that we could have found by working through a different, unlocked project instead.
To prevent this from happening, we make sure that projects present in the root lock are attempted before almost any other project. If they're attempted earlier, then they're selected earlier, and because backtracking moves backwards through the stack of selected versions, we can know that we won't make it back to a locked project until after we've exhausted the other options. This guarantees that if some locked versions changed in pursuit of a solution, then no solution was possible without at least one locked version changing. Once again, ordering is exceedingly important to the solver.