Skip to content
This repository has been archived by the owner on Feb 3, 2018. It is now read-only.

gps for Contributors

sam boyer edited this page Jul 22, 2016 · 4 revisions

Hi! We're thrilled you're thinking about contributing to gps. Feedback on these docs, or the library itself, is always welcome! 🎉 🎉

These docs provide higher-level insight into how gps works, with the goal of making it easier to wrap your head around it all. CONTRIBUTING.md has more of the logistical info. Please also note that gps abides by a code of conduct, and is MIT-licensed.

WARNING: these docs still need a lot of love - we're focusing more on the documentation for implementors right now. Familiarity with those is mostly assumed anyway in these documents, as it's hard to make sense of why gps's internals work as they do without understanding its intended inputs and outputs.

High-level Architecture

Constraint solving as an approach to dependency management is one of those rare things that's nice in theory, and can be even nicer in practice. Constraint solvers have a general form that suggest a natural architecture for the software: there's the solver itself, which is basically solving a math problem, versus the parts responsible for learning about the world and turning it into the math-y symbols the solver expects. This separation of responsibilities helps ensure the tail doesn't end up wagging the dog, even as we add progressively more complex checks to the solver. And indeed, this is gps's basic architecture:

Mostest Basic-est Arch

The solver explores the possible solutions (i.e., which dependencies to use at which versions), ensuring at each step that all known constraints are met. As it explores, it asks a SourceManager for information about the world (e.g., what versions of a dep exist, what deps and constraints does that dep have, etc.) that it needs to continue exploration. solver itself never touches disk or network. It works entirely on abstractions - names, versions, package names and import paths as strings, and others.

basicest-division

As a high-level representation of gps, this picture is mostly complete. It's missing one thing, though - gps also uses an adapter between solver and SourceManager, called the bridge. In fact, solver never actually talks to SourceManager directly; it only does so through the bridge.

basic-arch

For the most part, the bridge just forwards calls from solver to SourceManager. The difference is that the bridge is focused solely on this particular solver run, with its current inputs - the active project, at some path on disk, with whatever happens to be in the vendor/ directory, whether or not the working copy has changes in it. SourceManager, on the other hand, deals strictly with unmodified, upstream code, originally sourced from some remote repository.

basic-sources

The bridge transparently fields calls made from solver that deal with the current/root project, while passing along other requests to the SourceManager. This division not only helps keep the SourceManager's implementation cleaner, but it also paves the way for two bigger-picture goals:

  1. The expensive metadata computed by SourceManager can be aggressively cached, reused across runs, and potentially even across different tool implementations.
  2. Because the SourceManager deals only with upstream information, we could eventually swap in an implementation that uses a central server.

Together, solver, bridge, and SourceManager are gps's three main components; everything that the engine does can be understood in relation to them.

The Solving Algorithm

The solver is an incremental algorithm. The formal definition doesn't really matter; just keep two things in mind:

  • the solver focuses completely on one node (project or package set) at a time
  • its current state always holds a viable solution

Probably the easiest way to illustrate the solving algorithm's behavior is by working backwards from the constraints it's trying to satisfy. To illustrate that, we need a simple depgraph:

There's our Root project's two dependencies, A, B, and a dependency C that's shared between A and B. (We'll return later to what exactly this information is, and where it comes from). The solver's goal is to pick a version for each of the dependencies that's agreeable for every constraint producer in the depgraph.

Now, while that image suggests we "know" that A, B, and C are the overall set of requirements, the solver doesn't actually know that when it starts. It only knows about direct dependencies, A and B; C is discovered through exploration. And exploration, in fact, is the solving process. Beginning from only the root project, the solve loop goes roughly like this:

  1. Pick a dependency that is required, but for which a version has not yet been selected
  2. Pick a version of that dependency (we call ident + version an atom) from a queue of available versions
  3. Validate that that atom meets all current constraints
    • This certainly means passing constraints set by other selected atoms, but can also include things like ensuring no import cycles would be created
  4. If validation succeeds, select that atom, and proceed to the next required, but not-yet-selected dep
  5. If it does not, try validation with the next version in the version queue
  6. If we're out of versions to try, backtrack to a previous failure and try new versions of that, in hopes it might open new possibilities
  7. If backtracking fails, then solving failed

That's not the easiest thing to follow in the abstract, so here's an example. (At some point, we hope to replace this text with a fancy iterative visualization!)

  1. Start with dependencies on A and B in unselected queue
  2. Pick A to explore; discover that it has tag versions 1.1.0 and 1.1.1 available
  3. Attempt atom A@1.1.1
  4. It passes validation; select it, and discover dependency C with constraint =2.0.1
  5. Add C to the unselected queue
  6. Pick C to explore; discover that it has tag versions 2.0.0 and 2.0.1 available
  7. Attempt atom C@2.0.1
  8. That atom meets A's constraint, so it passes validation; select it
  9. Pick B to explore (it's all that's left in the unselected queue); discover that it has tag versions 1.0.0
  10. Attempt atom B@1.0.0
  11. Validation fails: atom has a dependency on C@=2.0.0, which is disjoint with A@1.1.1's dependency on C@=2.0.1
  12. No more versions of B to try; begin backtracking
  13. C was last selected; attempt its next version, 2.0.0
  14. Validation fails: C@2.0.0 also does not meet A@1.1.1's requirement of C@=2.0.1
  15. No more versions of C to try; backtrack again
  16. A was last selected; attempts its next version, 1.1.0
  17. Validation succeeds; select it, and discover dependency C@=2.0.0
  18. Add C to the unselected queue
  19. Pick C to explore; it still has 2.0.0 and 2.0.1 available
  20. Attempt atom C@2.0.1
  21. Validation fails: A@1.1.0 wants C@=2.0.0
  22. Attempt atom C@2.0.0
  23. Validation succeeds; select C@2.0.0
  24. Pick B to explore; it still has just 1.0.0
  25. Attempt atom B@1.0.0
  26. Validation succeeds; B@1.0.0 wants C@=2.0.0, which is what we have already. Select B@1.0.0
  27. Unselected queue is empty - we found a solution!
    • A@1.1.0
    • B@1.0.0
    • C@2.0.0

The important thing to note here is that this is a tightly-constrained iterative algorithm: at each step, we're only checking selecting a given atom would result in a valid overall state with respect to what we currently know. We don't go traipsing off along transitive paths, because we know we'll get there eventually. For example, we selected C@2.0.1 even though B couldn't have accepted it, because we hadn't looked at B yet. This is slightly wasteful (the real algorithm has some means for minimizing wasted work), but it's also fully general.

There are some crucial aspects of the solver this example overlooks - where all the dependency information comes from in the first place, how lock data factor in, initial solver inputs, package-level vs. project-level analysis, and other non-version types of constraints. Let's tackle each of these in turn.

Whence Solver Data?

It's all well and good to talk about dependencies and versions in the abstract, but we need to understand where that information comes from. The implementor docs give an outsider's perspective on this, which is worth reviewing, but contributors need more detail.

First, we have to break down the different types of dependency-related information that gps trafficks in.

  • Manifest: Manifests are primarily for providing version constraints on the set of imports expressed in packages at or below a ProjectRoot.
  • Constraint: a (version) constraint is a rule that can decide whether a given version is acceptable. Constraints can be combined via their Intersect method.
  • Version: Versions are labels that identify a particular snapshot of time in a code repository. gps has several types - revisions, semver tags, non-semver tags, and branches - which have different properties, and varying equivalency relationships. More on them later, but do note that all Versions are also Constraints, but not vice-versa.
  • Imports: Good ol' standard Go import paths, discovered by statically analyzing Go code with go/build and go/parser.
  • Identifiers: Identifiers are related to import paths, but slightly different:
    • An import path points to a package. These names point only to repository roots, called ProjectRoots (which may or may not also be a valid Go package).
    • They allow tools to explicitly state the network location from which a local import path should be sourced.

Of all of these, gps extracts only imports and versions itself, through the SourceManager. Manifests, names, and constraints all come from the ProjectAnalyzer, which must be injected into the SourceManager by the tool implementing gps. Again, the implementor docs have more details on ProjectAnalyzer.

This division of responsibility is intentional. gps deals with unambiguous information : Go's ecosystem is VCS-driven, so versions come from version control; go/parser provides standard, clear answers about Go source code, including imports. The other issues are left up to the implementing tool.

More concretely, most of the information the solver operates on comes from one of three SourceManager methods:

  • ListVersions(ProjectName) ([]Version, error): This method takes a name - an import path corresponding to a repository root - and returns a slice of versions representing the current set of versions - branches and tags - offered by the upstream repository.
  • ListPackages(ProjectName, Version) (PackageTree, error): given a name AND a version, and performs static analysis on the tree rooted at that import path, at that version, and returns the result.
  • GetManifestAndLock(ProjectName, Version) (Manifest, Lock, error): given a name and a version, this checks the right version out to disk, then asks the injected ProjectAnalyzer to analyze that tree and provide us with a manifest and/or lock.

Version queues and (transitive) stability

For the user, a lock file is the source of their build's stability and reproducibility: it describes exactly which versions (revisions, ideally) should be used for each dependency. Within the solver, however, locks don't play much of a role. Mostly, we care about Manifests, as that's where constraint information comes from, whereas Locks are quite literally the solution we're looking to produce produce.

But there's one exception. We do care about the root project's lock file; unless explicitly instructed otherwise, we want to produce a solution that's as close to what's in that lock as possible.

If it's not already clear, ordering is exceedingly important to the solver. The solution we end up picking is hugely dependent on the order in which we attempt the various available versions for each of our dependencies. If the first atom we visit works, then we'll never need to try another one. That basic fact is exactly what we exploit in order to retain the version specified in a lock file, if at all possible.

When we create a version queue for a project that's being visited for the first time, we check to see if the root's lock file (if present) names an atom for that project. If it does, and the provided version is acceptable with respect to current constraints, then we push it onto the front of the version queue. Doing so is sufficient to guarantee that the locked version will be used unless something else directly conflicts with it.

...almost sufficient, anyway. It's possible that, in search of a solution, the solver might backtrack to the locked atom and search for a new solution by changing it. Even if that ends up successfully finding a solution, there may have been another another possible solution that we could have found by working through a different, unlocked project instead.

To prevent this from happening, we make sure that projects present in the root lock are attempted before almost any other project. If they're attempted earlier, then they're selected earlier, and because backtracking moves backwards through the stack of selected versions, we can know that we won't make it back to a locked project until after we've exhausted the other options. This guarantees that if some locked versions changed in pursuit of a solution, then no solution was possible without at least one locked version changing. Once again, ordering is exceedingly important to the solver.