Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: reproducibility #22

Closed
wants to merge 1 commit into from
Closed

Conversation

dherman
Copy link
Contributor

@dherman dherman commented Sep 14, 2018

From the RFC:

This RFC proposes a toolchain management model for Notion that’s built around a primary goal of reproducibility: that is, giving users end-to-end control around toolchains that, once set up, always exhibit identical behavior. The resulting user experience should be ”install and forget” — after setting up a toolchain, a user or team never has to worry about the many forms of tool bitrot—hidden state changes (e.g. the addition or removal of global packages), subtle differences in tool versions—on an existing machine, between project collaborators’ machines, or when setting up a new machine. In short, reproducibility means Notion’s declarative configuration completely determines a Node environment’s behavior.

Rendered


In the [Node release process](https://github.com/nodejs/Release), there are generally two most recent even-numbered major versions of Node that are the most commonly used (corresponding to either one *Active LTS* and one *Current*, or two Active LTS, depending on the time of year), and one most recent odd-numbered major version. All of these versions are actively supported and may acquire new point-releases at any time.

Based on some initial experiments, after running `git gc`, a repository containing all versions (for one platform, e.g., darwin-x64) of a single major version of Node compresses to roughly 50MB; a repo consisting of all versions of 2 major versions of Node compresses to roughly 100 - 150MB; and a repo consisting of all versions of 3 major versions compresses to roughly 200MB.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding platform packs, these numbers seem to indicate that space requirements increase faster than major versions are added (50MB for one, 50-75MB each for two, ~67MB each for three).

If that's the case, what do we gain by bundling multiple majors together? While it's easy to imagine a developer needing all three, it seems to me equally easy to imagine them not needing all three, in which case the ballooning space requirements are doubly punishing.

...
```

If the lockfile pins `"typescript"` at version 3.0.3, then entering the project and running:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does Notion know to create the tsc shim in this case? Perhaps I'm out of the loop, but I seem to recall that we're not looking at hooking npm install or yarn add anymore?


Based on some initial experiments, after running `git gc`, a repository containing all versions (for one platform, e.g., darwin-x64) of a single major version of Node compresses to roughly 50MB; a repo consisting of all versions of 2 major versions of Node compresses to roughly 100 - 150MB; and a repo consisting of all versions of 3 major versions compresses to roughly 200MB.

This suggests that during the Notion installation process, we could fetch a current **platform pack** with all of the two most recent even-numbered and one most recent odd-numbered versions of Node in a compressed git directory. (Less commonly-needed versions of Node could be provided in separate repositories, perhaps one major-version per repository.) This would take a bit of up-front setup time, but as a result, users would get the following benefits:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a compressed git directory

Would this be an actual archive file, or will the repository be git cloned? Do we know which would typically be faster?

### Limited connectivity mode
We may eventually want to offer a mode that users can opt into to use smaller platform packs, for situations where limited connectivity makes it hard to download the default packs. This should not be the default, and the UX should not *encourage* this mode, because it’s too tempting to avoid up-front downloads and then end up with less compression, more aggregate downloads and disk usage, and generally slower behavior. Also, users might be confused into thinking that the initial download is wasteful and spread misinformation about disk usage. Generally, the bet is that a multi-version git repository will lead to lower disk usage, so we should not encourage the use of more minimal packs. But as long as the UX is framed in terms of the use case of *limited connectivity* (for example, `install.sh --limited-connectivity`), this should help avoid misconceptions.

Moreover, it should still be possible to upgrade a limited-connectivity install to a normal install, if the user finds themselves in better network conditions later on, without having to reinstall Notion from scratch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth considering a downgrade as well, for users switching from WiFi to cellular-tethered or some such.

- Global package binaries (via `npm i -g` or `yarn global add`) are a convenient and popular deployment model for tools—not only project build tools like `babel` and `tsc`, for which `npx` is a popular solution, but also for use cases that aren’t associated with a pre-existing project, such as `surge`, `ember new`, or `svgo`.
- Global package binaries make projects sensitive to the global state of a user’s machine, making project builds brittle, unportable, and unreliable. This is such a pervasive problem that many developers recommend against the use of global installations entirely.

A separate but related motivation is the desire to be able to install user tools on a local machine one and not worry about “drift” over time—that is, a developer should be able to install, say, `surge` or `svgo` and not worry that those tools might stop working because of changes to the currently-installed version of Node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"on a local machine and not worry"? I think there's an extra word in there


## Scenarios

It’s helpful to think about a few different kinds of scenarios that should support, and how reproducibility fits in.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"scenarios that Notion should support"?


It’s natural to question whether this implementation strategy is worth the effort: an alternative approach would be to allow projects to specify less precise version requirements (such as the ranges typically expressed in the `"engines"` field of `package.json`) and assume most differences will be benign. However, behavioral divergences between versions of Node do happen and are some of the trickiest bugs to nail down. Putting in extra work up front to ensure that these divergences *cannot happen, by construction* will eventually pay for itself when scaled across the Node ecosystem.

Another reasonable criticism is that pinning the Node version for tools in the user toolchain means that users will not automatically benefit from performance and security improvements in Node. There are a couple of reasons this is outweighed by the benefits of pinning. First, as users upgrade the tool version itself, they will automatically get re-pinned to the newer version of Node specified by newer versions of the tool’s `package.json`. Second, we should at least allow users the option to override the tool’s specified Node version. But by letting the tool choose its platform version by default, user’s have a stronger guarantee that the tool will work and continue to work consistently based on how it was tested.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RFC is missing a discussion about installing user tools that do not have a pinned node version. Currently there are 0 tools that specify a "toolchain", and even if Notion is wildly popular there will be a transition period where some tools have a pinned node version and some do not. Actually there may always be some tools that do not specify a "toolchain", for whatever reason.

Notion should be able to handle this situation and preserve reproducibility. That could be using the version of node in the user's config, or if that is not specified, using the user's default version of node, and pinning that somehow. In these cases Notion could modify the installed tool's package.json to include a "toolchain" section if it is not already present.

@dherman
Copy link
Contributor Author

dherman commented Dec 6, 2018

I think what I need to do next is rip out the discussion of platform packs and git representations, which should be separately investigated as an optimization technique down the road. That way this RFC can stay focused on the core ideas of reproducibility, which are mostly orthogonal to that implementation approach.

@dherman dherman mentioned this pull request Dec 8, 2018
@dherman
Copy link
Contributor Author

dherman commented Feb 1, 2019

Closing since this is superseded by #27.

@dherman dherman closed this Feb 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants