Experimental new parser for Nushell with the aim to eventually replace the old parser (including related data structures, such as EngineState).
To test it, run the parser on some file, e.g., cargo run -- spam.nu
.
The goals of the new parser are:
- Readability, maintainability (making it easier for contributors to grasp and improve)
- Performance (we need faster parsing such that, for example, parsing standard library on Nushell startup should be instantaneous)
- Correctness (thanks to 1., it should be easier to iron out quirks of the old parser)
Unlike the old parser, the new parser processes the source file in three stages:
- Parsing: Splitting the source file into tokens (lexing) and arranging them into AST nodes.
- Resolving: Binding symbol names to definitions such as commands or variables and making sure these names are visible in the scopes they are supposed to.
- Typechecking: Resolving / inferring types of values and making sure they match their expected types.
(TBD) Codegen: Emitting Nushell's IR
This stage-based approach clearly separates what is being done and makes it easier to contribute. For example, one can implement a syntax parsing for X inside the parsing stage without needing to worry about type checking just yet. That can be added as an improvement. In the old parser, all these stages were intertwined
The Compiler data structure holds the AST, similar to EngineState in the current Nushell (the Compiler might as well morph into EngineState over time). Unlike the old parser where AST was arranged in a tree structure, the new AST is just a flat vector of nodes, indexed by their position (NodeId) in the vector. This new flat structure should bring better performance and makes the AST more inspectable.
Some other notable differences vs. the old parser:
- Syntax of keywords like
def
is directly embedded into the language instead of being a weird hybrid of a regular command with parser side effects.
The parser is mostly incomplete, whole swaths of features are missing. Once we are able to emit some Nushell IR, we should start moving the parser to the main Nushell codebase behind a feature / CLI flag and work on integration with the rest of the engine etc. (To avoid engine-q situation as much as possible, for those who remember :-).)
With a great success we've used cargo insta
.
It takes a bit used to working with the snapshots, but they are really useful for reviewing the impact of your changes.
We use tango for running the benchmarks. This allows us to easily compare between different Git revisions.
To run the benchmarks (using cargo-export
):
cargo export target/benchmarks/compiler -- bench
-- builds the benchmark binary intarget/benchmarks/parser
target/benchmarks/compiler/benchmarks solo -s 100 --warmup true
-- run the benchmarks (--help
to see the available CLI parameters)
To compare against another revision, build it in another directory, then run <another-directory>/benchmarks compare <reference-directory> ...options
.
See this helper in the Nushell repository for a more streamlined experience (the benchmark-log
was written before the solo
run option was available).
You can just start hacking, the code should be more approachable than the old parser, and the stage-based structure makes it possible to split adding a feature into separate steps. It's better to start small and keep the PRs focused.
Apart from coding the parser directly, we could use more tests, and organize the current ones better. Perhaps the test output can be printed better? Also, old parser's tests can be brought over and in general, harmonizing the tests with the current Nushell's codebase would be a wise idea. The CI can be expanded to include Windows. Benchmarks are missing (and should be added since our goal is performance!). In other words, even if you don't want to code the parser directly, there are plenty other ways to contribute!