Skip to content

Commit

Permalink
rfc: Don't add metadata to package.json
Browse files Browse the repository at this point in the history
  • Loading branch information
isaacs committed Aug 15, 2019
1 parent 0b4df98 commit d9615e5
Showing 1 changed file with 291 additions and 0 deletions.
291 changes: 291 additions & 0 deletions accepted/0000-no-package-json-_fields.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,291 @@
# Do Not Put Meta Information In `_`-Prefixed `package.json` Fields

## Summary

Currently, npm stashes various information in the `package.json` files that
it creates.

We will stop putting metadata in `package.json` files in `node_modules`.
The fields that serve a purpose will be relocated to
`node_modules/.arborist-resolutions.json`. When this metadata does not
exist, npm will make a best effort to gather metadata from
`package-lock.json` or `yarn.lock`.

## Motivation

These fields are only relevant to npm, and as of version 6, npm relies on
the presence of these fields for most things it does with packages on disk.

This creates problems when teams use both npm and yarn or pnpm, because
the fields are missing.

In particular, `npm ls` does not function properly if these fields are not
set, treating packages without the `_fields` as if they were not present in
the tree at all. Running `npm install` will cause npm to behave in
confusing ways as well.

Writing back to `package.json` adds work to the installer's process, which
`npm ci` attempts to streamline. However, the mechanism of doing this
means that `npm ci` and `npm install` write _different_ metadata to
installed `package.json` files, further adding to the confusion.

All of this would be justifiable if these fields were necessary, or at
least provide some significant benefit. However, this is not the case.
Several of them are irrelevant or unnecessary, and rather than aid in
debugging, they merely add confusion.

## Detailed Explanation

The default in npm has traditionally been to write back the `package.json`
file stashing whatever metadata npm was already using for its own purpose.
However, some of them are no longer necessary, and can be removed with no
problem. Others need to find a new home.

A real-life example:

```json
{
"_args": [
[
"[email protected]",
"/Users/isaacs/dev/js/tap"
]
],
"_from": "[email protected]",
"_id": "[email protected]",
"_inBundle": false,
"_integrity": "sha512-HxJdYWq1MTIQbJ3nw0cqssHoTNU267KlrDuGZ1WYlxDStUtKUhOaJmh112/TZmHxxUfuJqPXSOm7tDyas0OSIQ==",
"_location": "/which",
"_phantomChildren": {},
"_requested": {
"type": "version",
"registry": true,
"raw": "[email protected]",
"name": "which",
"escapedName": "which",
"rawSpec": "1.3.1",
"saveSpec": null,
"fetchSpec": "1.3.1"
},
"_requiredBy": [
"/",
"/cross-spawn",
"/execa/cross-spawn",
"/istanbul-lib-processinfo/cross-spawn",
"/spawn-wrap"
],
"_resolved": "https://registry.npmjs.org/which/-/which-1.3.1.tgz",
"_spec": "1.3.1",
"_where": "/Users/isaacs/dev/js/tap",
"...": "the rest of package.json"
}
```

Most of this information is duplicative of information that is stored in
`package-lock.json`, or generated by the installer when building the tree
and not relevant afterwards.

Much of it can be omitted without giving up much, especially since it is
already stored elsewhere.

### Fields

- `_args` - the set of arguments to the install methods that resolved to
this dependency. (This is a 2-d array, and will have multiple entries if
a package is deduped more than once to a given location.)

As far as I can tell, this is fine to remove, because it is not
load-bearing anywhere in the system.

- `_inBundle` - Whether the package or any of its parents are a
`bundleDependency`.

This information can be deduced from the package.json files on disk.

- `_where` - Excessive. If you can find the `package.json` file to
read it, then you already know where it is.

Furthermore, since this captures the actual path on disk when the package
is _installed_, it can unexpectedly leak misleading information
in the case of `bundleDependencies`. Installing `[email protected]` will
result in a file at `node_modules/tap/node_modules/ink/package.json`
with a `_from` field of `[["[email protected]","/Users/isaacs/dev/js/tap"]]`.

- `_integrity` - The SRI integrity field for this package. This is
important for the `package-lock.json` to be able to ensure that we are
receiving the same bits as the lockfile expects.

- `_shasum` - Legacy sha1 integrity check. Converted to a `sha1` SRI
integrity field if present.

- `_requested` - The [npa](http://npm.im/npm-package-arg)-parsed specifier
that was requested.

If a package is required multiple times, `_requested` is the earliest
request that led to it being fetched.

This is load bearing throughout the system, but most of the cases where
it's being used will be modified or replaced in the Arborist refactor.
Most places where `node.package._requested` is being used, it would be
more appropriate to store the information on `node.requested` instead.

Arborist Node objects have a `edgesIn` Set that captures all dependents
on a package, with the spec that requires them, so this need
is filled elsewhere.

Conclusion: can be removed.

- `_resolved` - The specifier that was ultimately resolved. For registry
dependencies, this is the tarball url that was ultimately fetched. For
git dependencies, this is the requested git url with `#<commit-hash>`
attached to the end. Again, this metadata is important because we must
ensure that we get the same package tree between runs and on different
machines.

- `_from` - Legacy form of `_requested`. Passed to npa to turn into a
parsed object.

- `_spec` - Just an copy of `_requested.rawSpec`. Unnecessary.

- `_id` - `${name}@${version}`. Unnecessary.

- `_requiredBy` - This is load bearing currently, to create the package
trees shown in `npm ls`. However, Arborist adds a `node.edgesIn` field
which lists all the dependencies on a given node along with their types.

Still, having that information in-place is useful when attempting to
determine why a given package in the file tree is being included.
Using `npm ls`, this can be difficult to determine, since `npm ls` does
not provide a facility for filtering by the _location_ of a node, only
by its _name_.

A subsequent RFC will discuss adding `npm ls <location>`, which will
address this use case in a more user-friendly way than looking at
implementation details directly.

Conclusion: can be removed.

- `_shrinkwrap` - This is a complete copy of the `npm-shrinkwrap.json` file
if it exists in the package. Since it's a duplicate of data that already
exists (in roughly the same location), it can be safely omitted.

While a `remote` or `git` dependency might satisfy a version dependency
(based on the same version number), but the reverse is not true.
Arguably, if we have a dependency on a git or remote dep, then a
non-git or non-remote dep is not valid.

### Proposal

Add a file at `node_modules/.arborist-resolutions.json` which stores the
only two pieces of metadata that npm really needs. This file has the
following format:

```json
{
"/package-name": {
"resolved": "${current _resolved field}",
"integrity": "${current _integrity field}"
},
"/package-name/nested-dep": {
"resolved": "${url}",
"integrity": "${sri}"
},
"..."
}
```

The key is the `Node.location` field. Metadata that would have been
stored in `_integrity` and `_resolved` fields in
`node_modules/a/node_modules/b/package.json` will be in the
`.arborist-resolutions.json` under the key `/a/b`.

This information will _not_ be tracked for `bundleDependencies`, as they
are never fetched directly.

If the `node_modules/.arborist-resolutions.json` file is not present, then
attempt to get the reso

## Rationale and Alternatives

This is closest to how pnpm stores its lock file in
`node_modules/.pnpm-lock.yaml`. However, rather than storing the entire
lock file in `node_modules`, npm will store only the information that
cannot be read from the reified `node_modules` package hierarchy.

### Storing in the Cache

Another potential option is to stash this information in npm's cache, like
Yarn does. Both the `~/.npm` and `node_modules` folders are to some extent
transient, but it would be a step away from letting the npm cache be
focused on providing a content-addressible cache fully managed by `cacache`
in a way that the rest of the system can mostly ignore.

It does make a kind of intuitive sense to store integrity data in
the cache, and `cacache` does already map a request key to an integrity
guarantee. However, this information is more closely tied to the
installation in a given project tree. Ie, it's not just answering the
question "what is the integrity for the tarball at $url", but rather, "In
this project, when requesting `<spec>`, what url and integrity resolution
should I expect?"

### Storing Many Files Instead of One

Another option might be to create a folder at `node_modules/.arborist`
which can store anything we might find useful.

A single file for resolutions and integrity does limit our capabilities for
parallel/concurrent installations; either the file gets corrupted (if
written non-atomically), or the last writer wins, or we have to explore
some complex merge logic. This seems like an edge case, and since
resolutions will likely be identical most of the time anyway, it's not
particularly compelling.

One argument against writing out multiple
`node_modules/.arborist/package-metadata/some-package-resolution.json`
files is that it's slightly slower to write a lot of little files than a
single one, especially since it's yet another thing to shuffle around as
we move packages around in the tree. It's easier to just keep the
resolution and integrity on the Node object, and walk the tree in memory to
generate the JSON output.

### Compatibility Requirements

There is a high cost to breaking changes in `package.json` or
`package-lock.json` semantics, because this affects colleagues on a project
and, in the case of package.json, downstream installers.

However, in this case, the only compatibility issue with removing \_fields
from package.json would be if a user _downgraded_ their npm client to a
version that relied on this metadata, which is far more rare.

Improving compatibility with current yarn and pnpm versions is more
valuable than maintaining compatibility (on this feature) with previous npm
clients.

## Implementation

- Arborist.loadActual and Arborist.loadVirtual will read
`node_modules/.arborist-resolutions.json` for expected `_integrity` and
`_resolved` values.
- If the resolutions file is not found, then loadVirtual will look to the
`package-lock.json` or `yarn.lock` file (since loadVirtual requires the
presence of a lock file anyway); loadActual will look in the lock file,
falling back to the `package.json` file in the realized package tree.
- The integrity and resolved values will be stored on the Node objects, but
_not_ written to `package.json`.
- If thet integrity or resolution of a previously installed package cannot
be ascertained, then the package-lock.json will not include this data.
- When installing from a lock file that does not include integrity or
resolution, accept whatever integrity value the registry provides, or
whatever file we download.
- Whenever we make a change to the `node_modules` tree on disk, write the
resolved and integrity values back to
`node_modules/.arborist-resolutions.json`, even if `--save` is false.
This doesn't lock the dependency in place; it only makes it _possible_ to
lock it properly later, if desired.

## Prior Art

`node_modules/.pnpm-lock.yaml`

`~/Library/Caches/Yarn/v4/npm-<package>-<version>-<hash>/node_modules/<package>/.yarn-metadata.json`

0 comments on commit d9615e5

Please sign in to comment.