Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Detect ESM syntax in every ambiguous file #50064

Closed
GeoffreyBooth opened this issue Oct 6, 2023 · 28 comments
Closed

Proposal: Detect ESM syntax in every ambiguous file #50064

GeoffreyBooth opened this issue Oct 6, 2023 · 28 comments
Labels
esm Issues and PRs related to the ECMAScript Modules implementation. module Issues and PRs related to the module subsystem.

Comments

@GeoffreyBooth
Copy link
Member

GeoffreyBooth commented Oct 6, 2023

Building off of #50043 (comment), this is an alternate proposal to #50043. This aims to achieve the same goals of allowing ESM syntax in “loose” files, where there are no package.json files present; and in files whose nearest parent package.json lacks a type field. This would permit ESM syntax without needing to opt in, while avoiding breaking existing scripts and tutorials. Like #50043, this proposal would avoid any breaking changes, and the aim is to eventually make this enabled by default without requiring a flag.

The new behavior would be as follows:

  • Node would attempt to evaluate every ambiguous file (a file with no explicit .mjs or .cjs extension, either no package.json or one that lacks a type field) as CommonJS, as it already does today. If a file throws a SyntaxError for something that is only allowed in ES modules (import or export statement, import.meta, top-level await), Node would try again to evaluate the file as an ES module.
  • This behavior would apply to ambiguous files within node_modules.
  • This would apply to the main entry point and to files referenced via import. It would not apply to files referenced via require.
  • This behavior would be disabled by any marker of explicitness: .mjs or .cjs extension, a package.json type field, --input-type or --experimental-default-type.

Pros, in comparison with #50043:

  • Avoids the application behaving differently based on entry point: you can reference the entry point from a wrapper, such as a CommonJS test runner, and it will still run as ESM.
  • Prevents the “package author published ambiguous ESM” problem. If a package author develops locally in this mode, relying on it to enable ESM for ambiguous files, their package would still work for consumers.
  • This “per file” approach is the same as that taken by many other tools.

Cons:

  • Presumably this is more of a performance impact, as we’re parsing every ambiguous ES module twice. We should probably measure this to see how much of an impact it is. There should be no impact for CommonJS files, and also no impact for ES modules in an explicit package.json module scope.
  • Any files with ambiguous syntax, that could evaluate successfully as either CommonJS or ESM, would be run as CommonJS in this mode. A common example of such a file would be a polyfill that only sets a global variable, without importing or exporting anything.

@nodejs/loaders @nodejs/tsc

@GeoffreyBooth GeoffreyBooth added module Issues and PRs related to the module subsystem. esm Issues and PRs related to the ECMAScript Modules implementation. labels Oct 6, 2023
@bmeck
Copy link
Member

bmeck commented Oct 6, 2023

I'd like to add a Pro:

  • This "double parse" behavior is present in many tools while the alternative is not and generally would not add configuration burden to users of those tools or the tools themselves

Per general discussion:

I'd note if this is considered an error recovery mechanism rather than discrete choice of the module author swapping the default type to be ESM is likely not an issue as it would still fall into the error category. This would also make it sensible to have the ability to warn upon encountering this and allow fixing in an iterative manner.

If this is framed as an explicit design choice for userland authors rather than a recovery, it would be something authors see as reliable and likely would never be getting warnings about. I'd generally like to avoid this framing to allow a swap over and/or removing the ambiguity being seen as a good thing.

I would also note parsing files is DX cheaper than restarting a process and forcing updates to the code before doing so.

Performance impacts are likely something that could also be left to package authors. Generally, authors do like their library to be more performant. Additionally this impact is only once, during startup. The errors from import would generally be near the top of the file bailing out the parse early as well. The impact is further potentially lessened if we can reuse code cache between the parses.

@GeoffreyBooth
Copy link
Member Author

GeoffreyBooth commented Oct 6, 2023

Something we would need to decide is whether we’d want to print a warning on double parse. Something like “Node.js tried to run ./file.js as CommonJS but couldn’t, so it tried again to run it as an ES module. Avoid the performance penalty of Node.js making two attempts by adding "type": "module" to ./package.json.”

I think we wouldn’t want to warn on dependencies, as it would be annoying to see warnings triggered by third-party code. We could use the “under node_modules“ logic to decide not to show the warning. We also shouldn’t warn for files run with no controlling package.json file, as I think a double parse for loose files is an acceptable performance penalty.

The benefit here is that the user is made aware that they’re running their code more slowly than it needs to be run, and how to fix it. The downside is that this is another bit of friction for new users; it makes it seem like they’re doing something wrong (which in a sense, they are) which means that any ESM-first beginner tutorials would need to address this. Also is Node really “ESM-first” if it prints a warning if you don’t opt into ESM.

We should also do a benchmark to see just how much of a penalty this is. If it’s trivial, or if we can find a way to make it trivial, maybe it’s not worth warning about.

@bmeck

This comment was marked as resolved.

@GeoffreyBooth

This comment was marked as resolved.

@bmeck

This comment was marked as resolved.

@mcollina
Copy link
Member

mcollina commented Oct 6, 2023

I think this proposal would slow things down for the average user, as well as printing a warning for everybody, as we all have quite old dependencies that are unlikely to get an update.

@GeoffreyBooth
Copy link
Member Author

GeoffreyBooth commented Oct 6, 2023

I think this proposal would slow things down for the average user, as well as printing a warning for everybody, as we all have quite old dependencies that are unlikely to get an update.

The warning isn’t part of the proposal per se; we don’t necessarily need to print any warnings at all, for anyone. It was just an idea, of something that we might additionally want to consider. I wouldn’t want any warning that was triggered by dependencies; it would need to somehow be scoped to user code.

As for performance, we don’t know until we test. By definition this new mode would do detection only for files that currently error, so no app that works today would get any slower. The question is how slow this “ESM by detection” mode is, compared against the baseline of “explicit ESM” (i.e. "type": "module"). Arguably any level of performance is better than erroring, so by that logic any amount of performance hit doesn’t matter; but as a matter of principle Node should be performant by default, without someone needing to know a secret to squeeze out the optimal performance. I think cjs-module-lexer is a good point of reference, where we have a Wasm dependency doing a parsing pass for every imported CommonJS module, and it’s so fast that it’s not a noticeable performance cost. If per-file detection can happen at that level of efficiency, it might be worth it.

One last thing about performance: module loading time generally only affects initial startup time, for the typical app that isn’t doing dynamic import()s long after initialization. Many users don’t care all that much how long their apps take to start up; if you’re building a web server, that’s intended to run for hours or days, it doesn’t matter all that much if it takes 1 second or 2 seconds to start. I think this is why so few Node users bother to bundle their apps into a single file or with minified code like is common for frontend; for most use cases, startup time just doesn’t matter. And for users who care, besides adding "type" they can also bundle. This isn’t to say that we shouldn’t care about startup time, but just that it’s perhaps not a disqualifying metric.

@bmeck
Copy link
Member

bmeck commented Oct 6, 2023

quite old dependencies that are unlikely to get an update.

These should remain unaffected by this specific proposal since this only affects a well known error condition and doesn't propagate results from one file to others.

@joyeecheung
Copy link
Member

I think cjs-module-lexer is a good point of reference, where we have a Wasm dependency doing a parsing pass for every imported CommonJS module

I don't think we need a WASM module for for the proposed use case, V8 should be fast enough because we are looking for early SyntaxErrors, instead of a particular legal way of property assignment. I doubt a WASM dependency can be faster than V8 in this regard, the V8 parser (or, to be precise, the V8 preparser that's used for compiling top-level code) is already built for looking for early syntax errors quickly. It's also likely that these files would have an import statement very early in the file, so V8 should fail quickly instead of wasting time on parsing the rest of the file.

@Qard
Copy link
Member

Qard commented Oct 6, 2023

I think having slightly slow ambiguous ESM is better than no ambiguous ESM if it's a fallback like this which should not impact CJS performance unless it fails with a SyntaxError. It gives us a path to be able to drop CJS someday in the future, if that possibility ever comes. (I personally have my doubts we will ever be able to drop CJS, but I see no harm in setting ourselves up to be able to if it ever becomes a reasonable possibility.)

@WebReflection
Copy link
Contributor

I expect most of the time the error to be super early as import syntax is usually on top, not sure how much of a concrete slow down it is. I also would expect no warnings unless a flag is set … when such flag is set I’d expect all warnings including node_module folder entries, effectively allowing anyone to actually check possible ambiguous dependencies and either ignore these or deal with these.

This has the same goal of informing users which is not clear if they would ever benefit from such warning for their own entry point of choice, they could benefit as well for their dependency tree, imho.

Other than this, I’d love to see this happening ❤️

@bnoordhuis
Copy link
Member

This would apply to the main entry point and to files referenced via import. It would not apply to files referenced via require.

Why is that? (I think I know but I'd like to see it explained explicitly.)

@targos
Copy link
Member

targos commented Oct 7, 2023

I wonder how this would interact with the customisation hooks. It seems to me that it should happen when two conditions are present:

  • The final format returned by the hooks is "commonjs".
  • The parent module was evaluated as ESM.

@GeoffreyBooth
Copy link
Member Author

Why is that? (I think I know but I'd like to see it explained explicitly.)

Because you can't require an ES module.

@GeoffreyBooth
Copy link
Member Author

I wonder how this would interact with the customisation hooks. It seems to me that it should happen when two conditions are present:

Not if the parent module is ESM; if the module in question was imported. A CommonJS module can import() an ambiguous file.

I think realistically we should maybe have a new format value, like detect, and only do the detection if that is returned for a module. Otherwise how could a hook author opt out of this? Or maybe there's no reason to want to opt out?

@WebReflection
Copy link
Contributor

The explicit hints via extension or type in package json are the opt out already, right?

@GeoffreyBooth
Copy link
Member Author

The explicit hints via extension or type in package json are the opt out already, right?

I meant how could the hook author force detection to happen or not happen, not how could the application or library author opt out. Generally customization hooks should be able to do whatever overrides or custom behaviors the hook author desires.

@WebReflection
Copy link
Contributor

My point is that if a dependency uses dynamic import you can’t add a new format down that road … I think this feature as enabled by default with a flag that shows warnings if desired at all levels would both solve concerns and help developers moving forward. If there is a default, that acts differently in node modules, with a new format to add at runtime, the branching becomes rather unbearable and it will be harder to understand anything happening behind the scene, imho.

@GeoffreyBooth
Copy link
Member Author

GeoffreyBooth commented Oct 8, 2023

if a dependency uses dynamic import you can’t add a new format down that road

I’m not sure what this means. The “fallback to evaluate as ESM” only occurs on syntaxes that currently error in CommonJS. Since import() doesn’t error, it’s not part of this.

with a flag that shows warnings if desired at all levels would both solve concerns and help developers moving forward

This sounds like a separate feature that could be implemented as a follow-up. It’s also probably achievable via customization hooks.

@GeoffreyBooth
Copy link
Member Author

@isaacs:

It should auto detect for all files whose package.json resolves to the same package.json of the entry point module.

Not for anything in other packages or deps, but for everything directly in the entry point’s local package.

The reason this proposal includes dependencies is to avoid the hazard where a library works for its author, because the author is working in “ambiguous detection” mode, but then the library wouldn’t work for any consumers who install it if the detection stops at the node_modules line. Such libraries still wouldn’t work for older versions of Node, but at least they would work for any versions of Node starting with the version that includes detection of ambiguous files.

@isaacs
Copy link
Contributor

isaacs commented Oct 8, 2023

tl;dr - I'm in favor of this proposal pretty much exactly how @GeoffreyBooth has written it here, and rejecting the suggestion I made above.


hazard where a library works for its author, because the author is working in “ambiguous detection” mode, but then the library wouldn’t work for any consumers who install it

Good point, I was just thinking the same thing not long after posting that 😅

I think a better way to refine the suggestion (which I'm thinking I'm actually not advocating, but it's worth exploring to see why it's a bad idea) is that the setting of default type should be package-specific rather than module-specific. Ie:

  • If the entry point argv[1] module is .js and autodetected to ESM, then act as if the package.json in the local folder had "type": "module"
  • If a module within a package is autodetected, treat its package.json as if it contained "type": "module"
  • these should be the same within a given package.json's context, to prevent "works on my machine"

More poking at this:

What happens in this case?

index.js

// esm
import b from './bar.js'

bar.js

// commonjs
module.exports = import('./foo.js')

foo.js

// esm
export * from './baz.js'

baz.js

// commonjs
console.log(__dirname)

Would the autodetection happen at the entry point, then evaluate all the other modules as ESM? Or is it going to autodetect each module, and ultimately treat them as if they were named index.mjs, bar.cjs, foo.mjs, and baz.cjs?

It seems like if the state of the entry point sets the default .js parsing mode universally (or even universally within a given package) to "type":"module" then the commonjs files in the mix would fail if an ESM module is loaded first, and vice versa, so it'd be better to detect ESM syntax in any module anywhere that is implicitly being loaded as CJS (ie, it's named *.js and its nearest package.json file is missing or does not contain a "type" field.)

If it was package-specific, then it would at least behave the same way locally as when installed. But, for example in the case above, node baz.js would treat baz.js as cjs, and node foo.js would treat it as ESM (and fail).

@GeoffreyBooth
Copy link
Member Author

GeoffreyBooth commented Oct 8, 2023

I was initially leaning against this proposal in favor of #50043 because I was assuming that a) there’s no way this version could be performant, and b) there surely must be edge cases that would make this approach nonviable. I’ve since been persuaded that maybe this could be performant, or at least there’s no obvious reason why it couldn’t be, so it’s probably worth the effort of attempting an implementation to measure how fast or slow it is. And I keep trying to think of edge cases, and start writing out scenarios like “but what about dep.js that’s console.log(3) and can evaluate successfully as either CommonJS or ESM,” and I keep winding up with the same result—that the algorithm works. I can’t come up with an edge case where it breaks down, even including dependencies. Of course, this is everyone’s invitation to prove me wrong, so feel free 😄

The closest I can come to a case that breaks is the ambiguous-syntax file with no import or export and no CommonJS wrapper variables, that in effect runs as sloppy mode via detection or as ESM in a "type": "module" scope. But that doesn’t really break; it’s vanishingly rare for the lack of 'use strict' to make a difference in real world code, and the user can prevent the ambiguity by simply adding either 'use strict' or "type": "module". I noted this case under “Cons” in the top post, but I don’t think it should be disqualifying. It’s no more of a footgun than someone writing CommonJS without putting 'use strict' at the top of every file: if that’s important to you, you need to either enforce it via linting or by writing ESM (and marking your ESM explicitly, such as via "type": "module"). I don’t think the general userbase should lack this feature because a few people write code that actually benefits from strict mode protection and they neglect to opt into strict mode one way or another. Such files run as sloppy mode today and they would continue to run in sloppy mode under this proposal.

@ljharb
Copy link
Member

ljharb commented Oct 8, 2023

It's not nearly as rare as you'd think; there's a ton of heavily used packages from over a decade ago that only work when in sloppy mode, many of which have authors who are dead or burnt out and retired, and thus will never be updated. If one of these files is ran in strict mode, it will break and not be fixable.

@GeoffreyBooth GeoffreyBooth added the tsc-agenda Issues and PRs to discuss during the meetings of the TSC. label Oct 8, 2023
@GeoffreyBooth
Copy link
Member Author

I’ve started a branch for this at https://github.com/GeoffreyBooth/node/tree/ambiguous-detection. I’m not sure whether this or #50043 is the more viable approach but I think it’s probably worth trying to implement this one first, to answer the question of how significant the performance impact is when running detection on every ambiguous file. If you’d like to help me implement this please ping on https://openjs-foundation.slack.com/archives/C053UCCP940

@andrewbranch
Copy link

andrewbranch commented Oct 13, 2023

This would apply to the main entry point and to files referenced via import. It would not apply to files referenced via require

I assume you’re referring to package.json "exports" conditions here? A single file can be referenced both by "import" and "require", and if that file is in CommonJS format, that setup can work without error. For example, one way of avoiding dual-package hazard in Node.js might give import.node a CommonJS file:

{
  "name": "pkg",
  "exports": {
    ".": {
      "import": {
        "node": "./main.js",
        "default": "./main.mjs"
      },
      "require": "./main.js"
    }
  }
}

In this case, would main.js be statically determined to be referenced both by import and require (and if so, what is its format?), or do you mean that the detection algorithm would depend on the path module resolution took? Supposing that main.js is a script with no imports/exports/requires/CJS-specific usage, but performs some side effects, would it have two different module identities and evaluate twice if one file imports it and one file requires it?

import "pkg" // exports["."].import.node -> detects as ESM
require("pkg") // exports["."].require -> detects as CJS -> executes again?

@GeoffreyBooth
Copy link
Member Author

I assume you’re referring to package.json "exports" conditions here?

No, I’m referring to require calls and import statements/import() expressions in code. It’s not possible to require an ES module, so it doesn’t make sense to attempt to run detection on files referenced by require. They’re just always treated as CommonJS.

@andrewbranch
Copy link

Ok, very glad I misinterpreted that 😅

@GeoffreyBooth
Copy link
Member Author

Landed in #50096.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
esm Issues and PRs related to the ECMAScript Modules implementation. module Issues and PRs related to the module subsystem.
Projects
None yet
Development

No branches or pull requests