Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decouple parser from PNode #425

Merged
merged 1 commit into from
Sep 4, 2022
Merged

Conversation

haxscramper
Copy link
Collaborator

wip implementation of the #423

@haxscramper haxscramper marked this pull request as draft September 1, 2022 10:07
@haxscramper haxscramper marked this pull request as ready for review September 1, 2022 10:31
@haxscramper haxscramper requested a review from saem September 1, 2022 10:36
@haxscramper
Copy link
Collaborator Author

@saem this is a first pass on the implementation, which uncovered some implementation questions that I wrote down in changes - look them over and tell me about your thoughts on direction/solution. I'm considering whether I should also include the change to the DOD lexer in this PR or split it into a separate subsequent one.

Copy link
Collaborator

@saem saem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got them all, the PR is already really starting to shape up.

# consistently copies this information around anyway,
# so I will let it stay this way for now.
token*: Token # TODO Replace full token value with an index information
kind*: TNodeKind # NOTE/QUESTION - for now the same kind of nodes is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, or at least one will be a super set of the other. Ultimately separate enums are likely best.


# HACK explicit flags in order to track down all 'extra' information
# that is collected during parsing.
isBlockArg*: bool # QUESTION add 'nkStmtListBlockArg' or similar node
Copy link
Collaborator

@saem saem Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You found it too!

OK, so this dumb thing is used to tell the semantic analysis layer about a statement list (possibly also block) that are being passed to a macro, template, whatever. During semantic analysis we collapse/flatten nested statement lists into one level, but we also do statement list unwrapping, so if there is only one node in a statement list we unwrap it except if there is a defer (I think this might be a bug/design flaw).

That first flattening is guarded by this dumb flag. I personally think we should always flatten, no matter what, but never unwrap in semStmtList (it's the last/second last proc in semstmts, I rewrote it recently so it should be readable). Without that guard a macro can receive arbitrary chunks of AST not always wrapped in a statement list and it makes the really annoying/painful to write.

I do believe between fixing the unpacking logic and the AST the parser generates this shouldn't be a problem. My guess at the most correct fix is sematic analysis layer should:

  • unwrap as determined by the receiving node, so be done on add/index assign based on receiving node kind
  • flattening should always take place in semantic analysis of a statement list because that's just always right

Then this dumb guard should not be necessary. Separating the AST used by sem, parsing, and backend means we can change add and index assign behavior and it only impacts sem so we can do destination based unwrapping rather than have it live incorrectly in semStmtList. 🎉

Sorry that was a rather lengthy explanation.

This comment was marked as resolved.

@@ -97,6 +97,8 @@ const
callableDefs* = nkLambdaKinds + routineDefs

nkSymChoices* = {nkClosedSymChoice, nkOpenSymChoice}
nkFloatKinds* = nkFloatLiterals # QUESTION remove float literals
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, but maybe I'm missing some key point? Whatchya thinking?

Copy link
Collaborator Author

@haxscramper haxscramper Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have node kinds for everything and only float literals. It is a naming question, and it is a set of node kinds, not a set of literals. Also nkLiterals is inconsistent with Atomic kinds in the macros.nim, but that's another question. I think nkLiteralKinds, nkFloatKinds or even nkFloatLiteralKinds make more sense (the last one is the most sensible IMO, but a bit longer)

TBH I always found this enum set definitions a bit weirdly structured and lacking in some places, but it is hard to see the underlying structure immediately from the code, it must be an incremental improvement

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proc `[]`*(node: PNode, pos: NodePosName): PNode =
I think those two must be intrinsically related to each other (the collection of the enum kind sets and the [] named indexing operator), but for now that's just a new idea I got while elaborating on the "literals/kinds" distinction above

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For nkFloatLiterals vs nkFloatKinds, the former is literals only and the latter might include more, but I can't think of a single thing not covered by the former. The is a difference in meaning but I'm really hard pressed to think of a case where they would not be equivalent. I would check the sites and consolidate personally.


proc `[]`*(node: PNode, pos: NodePosName): PNode =

I think those two must be intrinsically related to each other (the collection of the enum kind sets and the [] named indexing operator), but for now that's just a new idea I got while elaborating on the "literals/kinds" distinction above

That's exactly the types of insights that form by looking at this stuff.

I agree with the sentiment, the naming and categorization needs to make sense based on purpose and not solely because a new programmer's first instincts are to lump them together.

So to recap:

  • naming needs to be more consistent
  • naming needs to better convey intention
  • sets need to serve a purpose(s), eg: many sites in the compiler need to handle all literals one way (nkLiterals) or lots of literal handling is weird about floats vs other numbers (nkFloatLiterals)

Doesn't have to be those precise names but I think we're on the same page. For actually changing sets carefully looking through their usage is required because there are unfortunate subtleties.

compiler/ast/parser.nim Show resolved Hide resolved
Copy link
Collaborator

@zerbina zerbina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I answered a question and left some small suggestions regarding code style.

compiler/ast/ast_parsed_types.nim Show resolved Hide resolved
compiler/ast/parser.nim Outdated Show resolved Hide resolved
compiler/sem/passes.nim Outdated Show resolved Hide resolved
@haxscramper
Copy link
Collaborator Author

haxscramper commented Sep 1, 2022

I think I got them all, the PR is already really starting to shape up.

@saem I think the optimal plan would be to split in following steps (PRs)

  1. Clean up the parser implementation - this PR. It introduces another slowdown of the parsing stage, but not too drastic, and immediately enables decoupling.
  2. Throw away nimpretty hacks - `nimpretty` #113
  3. Transition lexer to use DOD approach - I'm thinking of a global FileIndex -> seq[Token] mapping that is generated by the lexer. Each token includes a line, a column and a tag information. And an extent range.
  • ParsedNode includes (FileIndex, TokenIndex) pair that is used to get the token back from the main storage
  • No tokens are discarded from the tokenizer - Non-documentation comments, obviously included. I'm not exactly sure about the INDENT/DEDENT/SAME-INDENT token set, but I think we can also retain them.
  1. Change the ParsedNode to use a DOD structure as well

@saem
Copy link
Collaborator

saem commented Sep 1, 2022

I think I got them all, the PR is already really starting to shape up.

@saem I think the optimal plan would be to split in following steps (PRs)

This does look better and I'm sure it'll be revised more as we continue, let's do it. Additional benefit, smaller chunks, easier review, testing, etc.

  1. Clean up the parser implementation - this PR. It introduces another slowdown of the parsing stage, but not too drastic, and immediately enables decoupling.

Cool, that's what I was thinking things are shaping up around.

  1. Throw away nimpretty hacks - nimpretty #113

Yes, less hacks and more Hax. :D

  1. Transition lexer to use DOD approach - I'm thinking of a global FileIndex -> seq[Token] mapping that is generated by the lexer. Each token includes a line, a column and a tag information. And an extent range.

Yup, will need to support:

  • lex me a string (for reasons we have a fragment of code)
  • lex me a string which is a fragment of a file

But I'm assuming that's implied. The former is because we might be asked to handle various code fragments like repls and cases where we don't have a file perse. The latter is where we are formatting a subsection of some file. The amount of effort and support needed here might be near zero. Just highlighting two interesting cases besides the majority case of program/file(s) compilation.

  • ParsedNode includes (FileIndex, TokenIndex) pair that is used to get the token back from the main storage

Yup, then it would only need the left and right index and probably be done.

  • No tokens are discarded from the tokenizer - Non-documentation comments, obviously included. I'm not exactly sure about the INDENT/DEDENT/SAME-INDENT token set, but I think we can also retain them.

Based on what I could glean from the paper and @alaviss 's remarks I think knowing this is important for these reasons:

  • need indent level (logical) for preserving meaning
  • need actual spaces (physical) for spans, spaces for alignment, run length encoding/compressing the stream
  1. Change the ParsedNode to use a DOD structure as well

Yup, I think it'll be easier if the lexer has had a once over already.

@haxscramper
Copy link
Collaborator Author

haxscramper commented Sep 2, 2022

Step #5 is to add reduced parsed node set - based on the TNodeKind, but without sem-related data. I initially forgot about that part - while ParsedNode and PNode are structurally similar, the latter one has a lot more states.

nkError probably should stay if we want to get parser error correction in later.

@haxscramper haxscramper force-pushed the parsed-node-refactor branch 4 times, most recently from d05b275 to 601b71c Compare September 2, 2022 20:50
@haxscramper haxscramper force-pushed the parsed-node-refactor branch 6 times, most recently from c9f7f87 to 50a7ab1 Compare September 3, 2022 15:17
compiler/ast/parser.nim Outdated Show resolved Hide resolved
@haxscramper haxscramper force-pushed the parsed-node-refactor branch 4 times, most recently from 7969466 to 98d807a Compare September 3, 2022 22:04
compiler/ast/parser.nim Outdated Show resolved Hide resolved
Copy link
Collaborator

@saem saem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a pass on all the comments.

compiler/ast/ast.nim Show resolved Hide resolved
compiler/ast/ast_parsed_types.nim Show resolved Hide resolved
compiler/ast/ast_parsed_types.nim Show resolved Hide resolved

# HACK explicit flags in order to track down all 'extra' information
# that is collected during parsing.
isBlockArg*: bool # QUESTION add 'nkStmtListBlockArg' or similar node

This comment was marked as resolved.

compiler/ast/ast_parsed_types.nim Show resolved Hide resolved
compiler/sem/passes.nim Show resolved Hide resolved
compiler/sem/passes.nim Show resolved Hide resolved
compiler/sem/passes.nim Show resolved Hide resolved
compiler/sem/passes.nim Show resolved Hide resolved
compiler/sem/pragmas.nim Show resolved Hide resolved
The main change - introduce a `ParsedNode` type which replaces `PNode`
in the parser. This change allows for further work on decoupling `sem`
from other parts of the compiler, making it easier to implement
improvements in a way that would not rip through the whole codebase and
test suite. Right now introduced type closely mimics the `PNode`
counterpart, but this is just a temporary measure for the transition
period. This commit is a part of multi-step series - full list can be
seen in the related issue nim-works#423

* Documentation changes

- Add missing documentation for changes in the earlier commit, add more
  how-tos to the debugging section (I haven't coded in a while, so was
  especially important to write down explanations for anything I had
  trouble with)
  nim-works@602367b

* Tangentially related refactoring work

- Cleanup the `passes.nim` implementation a bit - despite common (at
  least seemingly shared by many of the previous authors of the
  codebase) misconception longer variable names actually *do* increase
  readability. Also infamous recommendations for the "structured
  programming" also do not really mesh with proliferation of `break`
  statements in the code.

  Add todo/bug comment for the main processing loop bug related to the
  phase ordering in `compiler/sem/passes.nim:234`

* Debugging tools improvements

- Implement `astrepr.nim` support for the `ParsedNode` and `PIdent` -
  `debug` and `treeRepr` procedures.
- Allow skipping repeated symbol in the `(open|closed)SymChoice` node
  kinds in the `astrepr`
- Restructure imports of the `astepr` and move it closer to the
  'primitive' modules - type definitions and trivial data queries. The
  most important change is removal of the `ast.nim` and `renderer.nim`
  imports, which opens these modules for debugging as well.
- Consider possibility of a nil `owner` in the symbol owner chain
  representation calculations in `astrepr`
- Semantic tracer debug output file rotation now uses location of the
  first `.define(` call as a file name base instead of integer-based
  ones. Added basic logging information about created files - now a
  developer can see what is going on and what gets written.

  For example, running with `--define=nimCompilerDebugTraceDir=/tmp` and
  seveal `define(...)` sections produces the following output:

  ```
  comparisons.nim(269, 8): opening /tmp/comparisons_nim_0 trace
  comparisons.nim(274, 7): closing trace, wrote 44 records
  comparisons.nim(276, 8): opening /tmp/comparisons_nim_1 trace
  comparisons.nim(285, 7): closing trace, wrote 329 records
  ```
- Simplify implementation of the `reportInst` handling in the debug
  utils tracer - now each toplevel tracer template must submit the
  location by itself - this solution avoids unintuitive and fragile
  `instLoc(-5)` call which might break with more templates introduced.
  Also updated documentation on the `reportInst` and `reportFrom` in the
  reports file.

- compiler/front/options.nim:693 :: Unconditionally output debugging
  traces if they are requested, regardless of the surrounding hooks and
  filters. Introduce the `bypassWriteHookForTrace` flag in the debugging
  hack controller which makes it possible to bypass the `writeln` hook.

* Further work

- compiler/ast/parser.nim:744 :: introduce two tokens in order to handle
  custom literals. There is no real need to mash together everything in
  a single chunk of text that would have to be split apart down the
  line.
@saem
Copy link
Collaborator

saem commented Sep 4, 2022

bors r+

@saem
Copy link
Collaborator

saem commented Sep 4, 2022

Merging now, @haxscramper have a look at the unresolved comment related suggestions.

It might be easiest to simply use the same branch, apply the suggestions, and then do an rebase onto devel.

@bors
Copy link
Contributor

bors bot commented Sep 4, 2022

Build succeeded:

@bors bors bot merged commit f4b8323 into nim-works:devel Sep 4, 2022
haxscramper added a commit to haxscramper/nimskull that referenced this pull request Sep 4, 2022
follow-up PR for review comments from the
nim-works#425
bors bot added a commit that referenced this pull request Sep 4, 2022
428: parser refactor follow-up code style fixes r=haxscramper a=haxscramper

follow-up PR for review comments from the
#425



Co-authored-by: haxscramper <[email protected]>
@haxscramper haxscramper added this to the Parser phase refactoring milestone Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants