decouple parser from PNode #425

haxscramper · 2022-09-01T10:07:10Z

wip implementation of the #423

haxscramper · 2022-09-01T10:38:07Z

@saem this is a first pass on the implementation, which uncovered some implementation questions that I wrote down in changes - look them over and tell me about your thoughts on direction/solution. I'm considering whether I should also include the change to the DOD lexer in this PR or split it into a separate subsequent one.

saem

I think I got them all, the PR is already really starting to shape up.

saem · 2022-09-01T14:10:48Z

compiler/ast/ast_parsed_types.nim

+                     # consistently copies this information around anyway,
+                     # so I will let it stay this way for now.
+    token*: Token # TODO Replace full token value with an index information
+    kind*: TNodeKind # NOTE/QUESTION - for now the same kind of nodes is


Yes, or at least one will be a super set of the other. Ultimately separate enums are likely best.

saem · 2022-09-01T14:31:30Z

compiler/ast/ast_parsed_types.nim

+
+    # HACK explicit flags in order to track down all 'extra' information
+    # that is collected during parsing.
+    isBlockArg*: bool # QUESTION add 'nkStmtListBlockArg' or similar node


You found it too!

OK, so this dumb thing is used to tell the semantic analysis layer about a statement list (possibly also block) that are being passed to a macro, template, whatever. During semantic analysis we collapse/flatten nested statement lists into one level, but we also do statement list unwrapping, so if there is only one node in a statement list we unwrap it except if there is a defer (I think this might be a bug/design flaw).

That first flattening is guarded by this dumb flag. I personally think we should always flatten, no matter what, but never unwrap in semStmtList (it's the last/second last proc in semstmts, I rewrote it recently so it should be readable). Without that guard a macro can receive arbitrary chunks of AST not always wrapped in a statement list and it makes the really annoying/painful to write.

I do believe between fixing the unpacking logic and the AST the parser generates this shouldn't be a problem. My guess at the most correct fix is sematic analysis layer should:

unwrap as determined by the receiving node, so be done on add/index assign based on receiving node kind

flattening should always take place in semantic analysis of a statement list because that's just always right

Then this dumb guard should not be necessary. Separating the AST used by sem, parsing, and backend means we can change add and index assign behavior and it only impacts sem so we can do destination based unwrapping rather than have it live incorrectly in semStmtList. 🎉

Sorry that was a rather lengthy explanation.

saem · 2022-09-01T14:33:23Z

compiler/ast/ast_query.nim

@@ -97,6 +97,8 @@ const
  callableDefs* = nkLambdaKinds + routineDefs

  nkSymChoices* = {nkClosedSymChoice, nkOpenSymChoice}
+  nkFloatKinds* = nkFloatLiterals # QUESTION remove float literals


I don't think so, but maybe I'm missing some key point? Whatchya thinking?

We have node kinds for everything and only float literals. It is a naming question, and it is a set of node kinds, not a set of literals. Also nkLiterals is inconsistent with Atomic kinds in the macros.nim, but that's another question. I think nkLiteralKinds, nkFloatKinds or even nkFloatLiteralKinds make more sense (the last one is the most sensible IMO, but a bit longer)

TBH I always found this enum set definitions a bit weirdly structured and lacking in some places, but it is hard to see the underlying structure immediately from the code, it must be an incremental improvement

nimskull/compiler/ast/ast_query.nim

Line 687 in 837238f

proc `[]`*(node: PNode, pos: NodePosName): PNode =

I think those two must be intrinsically related to each other (the collection of the enum kind sets and the [] named indexing operator), but for now that's just a new idea I got while elaborating on the "literals/kinds" distinction above

For nkFloatLiterals vs nkFloatKinds, the former is literals only and the latter might include more, but I can't think of a single thing not covered by the former. The is a difference in meaning but I'm really hard pressed to think of a case where they would not be equivalent. I would check the sites and consolidate personally.

nimskull/compiler/ast/ast_query.nim

Line 687 in 837238f

proc `[]`*(node: PNode, pos: NodePosName): PNode =

I think those two must be intrinsically related to each other (the collection of the enum kind sets and the [] named indexing operator), but for now that's just a new idea I got while elaborating on the "literals/kinds" distinction above

That's exactly the types of insights that form by looking at this stuff.

I agree with the sentiment, the naming and categorization needs to make sense based on purpose and not solely because a new programmer's first instincts are to lump them together.

So to recap:

naming needs to be more consistent

naming needs to better convey intention

sets need to serve a purpose(s), eg: many sites in the compiler need to handle all literals one way (nkLiterals) or lots of literal handling is weird about floats vs other numbers (nkFloatLiterals)

Doesn't have to be those precise names but I think we're on the same page. For actually changing sets carefully looking through their usage is required because there are unfortunate subtleties.

compiler/ast/parser.nim

zerbina

I answered a question and left some small suggestions regarding code style.

compiler/ast/ast_parsed_types.nim

compiler/ast/parser.nim

compiler/sem/passes.nim

haxscramper · 2022-09-01T16:42:38Z

I think I got them all, the PR is already really starting to shape up.

@saem I think the optimal plan would be to split in following steps (PRs)

Clean up the parser implementation - this PR. It introduces another slowdown of the parsing stage, but not too drastic, and immediately enables decoupling.
Throw away nimpretty hacks - `nimpretty` #113
Transition lexer to use DOD approach - I'm thinking of a global FileIndex -> seq[Token] mapping that is generated by the lexer. Each token includes a line, a column and a tag information. And an extent range.

ParsedNode includes (FileIndex, TokenIndex) pair that is used to get the token back from the main storage
No tokens are discarded from the tokenizer - Non-documentation comments, obviously included. I'm not exactly sure about the INDENT/DEDENT/SAME-INDENT token set, but I think we can also retain them.

Change the ParsedNode to use a DOD structure as well

saem · 2022-09-01T19:34:58Z

I think I got them all, the PR is already really starting to shape up.

@saem I think the optimal plan would be to split in following steps (PRs)

This does look better and I'm sure it'll be revised more as we continue, let's do it. Additional benefit, smaller chunks, easier review, testing, etc.

Clean up the parser implementation - this PR. It introduces another slowdown of the parsing stage, but not too drastic, and immediately enables decoupling.

Cool, that's what I was thinking things are shaping up around.

Throw away nimpretty hacks - nimpretty #113

Yes, less hacks and more Hax. :D

Transition lexer to use DOD approach - I'm thinking of a global FileIndex -> seq[Token] mapping that is generated by the lexer. Each token includes a line, a column and a tag information. And an extent range.

Yup, will need to support:

lex me a string (for reasons we have a fragment of code)
lex me a string which is a fragment of a file

But I'm assuming that's implied. The former is because we might be asked to handle various code fragments like repls and cases where we don't have a file perse. The latter is where we are formatting a subsection of some file. The amount of effort and support needed here might be near zero. Just highlighting two interesting cases besides the majority case of program/file(s) compilation.

ParsedNode includes (FileIndex, TokenIndex) pair that is used to get the token back from the main storage

Yup, then it would only need the left and right index and probably be done.

No tokens are discarded from the tokenizer - Non-documentation comments, obviously included. I'm not exactly sure about the INDENT/DEDENT/SAME-INDENT token set, but I think we can also retain them.

Based on what I could glean from the paper and @alaviss 's remarks I think knowing this is important for these reasons:

need indent level (logical) for preserving meaning
need actual spaces (physical) for spans, spaces for alignment, run length encoding/compressing the stream

Change the ParsedNode to use a DOD structure as well

Yup, I think it'll be easier if the lexer has had a once over already.

haxscramper · 2022-09-02T16:53:44Z

Step #5 is to add reduced parsed node set - based on the TNodeKind, but without sem-related data. I initially forgot about that part - while ParsedNode and PNode are structurally similar, the latter one has a lot more states.

nkError probably should stay if we want to get parser error correction in later.

compiler/ast/ast.nim

compiler/sem/passes.nim

compiler/ast/parser.nim

compiler/ast/ast_parsed_types.nim

compiler/sem/passes.nim

compiler/ast/parser.nim

saem

Did a pass on all the comments.

compiler/ast/ast.nim

compiler/ast/ast_parsed_types.nim

+
+    # HACK explicit flags in order to track down all 'extra' information
+    # that is collected during parsing.
+    isBlockArg*: bool # QUESTION add 'nkStmtListBlockArg' or similar node


compiler/ast/ast_parsed_types.nim

compiler/sem/passes.nim

compiler/sem/pragmas.nim

The main change - introduce a `ParsedNode` type which replaces `PNode` in the parser. This change allows for further work on decoupling `sem` from other parts of the compiler, making it easier to implement improvements in a way that would not rip through the whole codebase and test suite. Right now introduced type closely mimics the `PNode` counterpart, but this is just a temporary measure for the transition period. This commit is a part of multi-step series - full list can be seen in the related issue nim-works#423 * Documentation changes - Add missing documentation for changes in the earlier commit, add more how-tos to the debugging section (I haven't coded in a while, so was especially important to write down explanations for anything I had trouble with) nim-works@602367b * Tangentially related refactoring work - Cleanup the `passes.nim` implementation a bit - despite common (at least seemingly shared by many of the previous authors of the codebase) misconception longer variable names actually *do* increase readability. Also infamous recommendations for the "structured programming" also do not really mesh with proliferation of `break` statements in the code. Add todo/bug comment for the main processing loop bug related to the phase ordering in `compiler/sem/passes.nim:234` * Debugging tools improvements - Implement `astrepr.nim` support for the `ParsedNode` and `PIdent` - `debug` and `treeRepr` procedures. - Allow skipping repeated symbol in the `(open|closed)SymChoice` node kinds in the `astrepr` - Restructure imports of the `astepr` and move it closer to the 'primitive' modules - type definitions and trivial data queries. The most important change is removal of the `ast.nim` and `renderer.nim` imports, which opens these modules for debugging as well. - Consider possibility of a nil `owner` in the symbol owner chain representation calculations in `astrepr` - Semantic tracer debug output file rotation now uses location of the first `.define(` call as a file name base instead of integer-based ones. Added basic logging information about created files - now a developer can see what is going on and what gets written. For example, running with `--define=nimCompilerDebugTraceDir=/tmp` and seveal `define(...)` sections produces the following output: ``` comparisons.nim(269, 8): opening /tmp/comparisons_nim_0 trace comparisons.nim(274, 7): closing trace, wrote 44 records comparisons.nim(276, 8): opening /tmp/comparisons_nim_1 trace comparisons.nim(285, 7): closing trace, wrote 329 records ``` - Simplify implementation of the `reportInst` handling in the debug utils tracer - now each toplevel tracer template must submit the location by itself - this solution avoids unintuitive and fragile `instLoc(-5)` call which might break with more templates introduced. Also updated documentation on the `reportInst` and `reportFrom` in the reports file. - compiler/front/options.nim:693 :: Unconditionally output debugging traces if they are requested, regardless of the surrounding hooks and filters. Introduce the `bypassWriteHookForTrace` flag in the debugging hack controller which makes it possible to bypass the `writeln` hook. * Further work - compiler/ast/parser.nim:744 :: introduce two tokens in order to handle custom literals. There is no real need to mash together everything in a single chunk of text that would have to be split apart down the line.

saem · 2022-09-04T00:19:20Z

bors r+

saem · 2022-09-04T00:21:22Z

Merging now, @haxscramper have a look at the unresolved comment related suggestions.

It might be easiest to simply use the same branch, apply the suggestions, and then do an rebase onto devel.

bors · 2022-09-04T01:19:59Z

Build succeeded:

follow-up PR for review comments from the nim-works#425

428: parser refactor follow-up code style fixes r=haxscramper a=haxscramper follow-up PR for review comments from the #425 Co-authored-by: haxscramper <[email protected]>

haxscramper marked this pull request as draft September 1, 2022 10:07

haxscramper force-pushed the parsed-node-refactor branch from 0ca4f69 to 8bb510d Compare September 1, 2022 10:31

haxscramper marked this pull request as ready for review September 1, 2022 10:31

haxscramper force-pushed the parsed-node-refactor branch from 8bb510d to dbe2edd Compare September 1, 2022 10:36

haxscramper requested a review from saem September 1, 2022 10:36

saem reviewed Sep 1, 2022

View reviewed changes

zerbina reviewed Sep 1, 2022

View reviewed changes

compiler/ast/ast_parsed_types.nim Show resolved Hide resolved

compiler/ast/parser.nim Outdated Show resolved Hide resolved

compiler/sem/passes.nim Outdated Show resolved Hide resolved

haxscramper force-pushed the parsed-node-refactor branch 4 times, most recently from d05b275 to 601b71c Compare September 2, 2022 20:50

haxscramper mentioned this pull request Sep 2, 2022

Decouple parser from PNode with DOD approach #423

Open

5 tasks

haxscramper force-pushed the parsed-node-refactor branch from 601b71c to bb16d0a Compare September 2, 2022 21:12

saem reviewed Sep 2, 2022

View reviewed changes

compiler/ast/ast.nim Show resolved Hide resolved

haxscramper force-pushed the parsed-node-refactor branch 6 times, most recently from c9f7f87 to 50a7ab1 Compare September 3, 2022 15:17