You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While reviewing and testing out the new formatting functionality (#291) I ran into a number of issues related to comments. There are a lot of places where comments in the input VCL are lost in the parsing process and as a consequence are missing from the formatted output. While this is largely most relevant to the formatting efforts I'm making this a separate issue since at its root this is a parser problem.
To use an over the top example, for the following VCL input:
In this case only comments 1 and 6 are captured during parsing and a print of this node would yield
/* 1 */
declare local var.s STRING; /* 6 */
There are many other cases of valid comment positions that get lost during parsing, before semi-colons, trailing comments at the end of files, etc. Some of the loss is simply in the printing and the comments are attached to nodes. It's just the printing logic isn't emitting them, others are completely lost when tokens are read without a node being created for them.
TL;DR
We're currently losing comments but our existing comment attachment points should be flexible enough to correctly capture all possible comments. To do so will require tweaks to the parser to ensure all of them are correctly attached to their relevant node. Will submit a PR with a starting point implementation for discussion.
Comment attachment point analysis
Currently our AST node meta struct can handle a max of 3 comment attachment points (leading, infix, trailing) so it is important to confirm that this is enough to handle all cases. The following is an attempt to cover the VCL syntax and verify that our current model is sufficient to represent all comment positions as well as identifying what nodes and which positions each comment location should be attached to.
Position markers:
<start> - comment immedietly preceding the node
<end> - comment immediatly following the node
<after_keyword> - comment immedietaly following a keyword
<after_bracket> - comment immedietly following a bracket ((, {)
Function expression AST node stores arguments in a slice so when the expression has no arguments there is nowhere to attach the comments within the argument list. In this case the comment could be attached to the function expression.
<x> foo <x> (<after_bracket>) <end>
attachment points: 2
Others:
acl entries
<x> ip <x> ; <end>
<start> ! <x> ip <x> ; <end>
<x> ip <x> / <x> mask <x> ; <end>
<start> ! <x> ip <x> / <x> mask <x> ; <end>
table entries
<x> key <x> : <x> value <x>
<x> key <x> : <x> value <x> , <end>
In some cases there is ambiguity around which node a comment should be attached to.
Comment could be an comment for 200 or a comment for "OK"
error 200 <x> "OK";
Comment could be an <after_keyword> comment for the error node or comment for the status code.
error <x> 200;
For cases of single comments between a token and a value such as in the case above I think it makes more sense for the comment to be attached as a comment for the value. As the comment attachment logic for the declare statement doesn't need to worry about the comment and it can be handled by the ident parsing. For the same reason I think for situations like declare local var.s <x> STRING; attaching the comment as an comment for var.s makes the most sense. No need to special case the ident comment attachment logic since there is no formatting operation that would be impacted by which node the comment is attached to.
For multiple comments between tokens should all be comments for 200, all comments for "OK", or split between them.
error 200 <x> <y> <z> "OK";
For single line statements like this again I think there is little value in attempting to apply meaning to the attachment for each of the comments between idents as there is no formatting rules that would be impacted by it.
For comments between two statements should the comment be an comment for the first statement or a comment for the next one.
log "foo";
<x>
log "bar";
Typically comments on their own line are associated with the statements following them vs a statement preceding them. So in situations like this the comment should be attached as a comment for the second log statement.
Similarly when multiple comments are between two statements.
log "foo";
<x>
<y>
<z>
log "bar";
Again for this case I would say the comments should be attached as comments for the second log statement.
Some uncertainty does come up when dealing with comments at the declaration level:
sub foo () {}
<x>
<y>
<z>
sub bar () {}
The simplest solution would be to consider the comments as comments for the second function declaration.
Note: If declarations do not check for trailing comments then a special case handling for EOF would need to be added to attach any remaining unbound comments to the last declaration node found in the source.
Conclusions
So far I have not found any syntax structures that would not be able to be represented with our current 3 comment attachment point model.
This is a lot to get through and much of this is notes for implementing a fix.
The text was updated successfully, but these errors were encountered:
I realized that the current comments parsing won't be enough.
Fortunately, the lexer can lex each comment but the parser will lack them.
So I will change saving comments in the ast.Meta struct as map[position]ast.Comments to get comments placed at any positions that you describe.
The // d comment will be attached to the first statement inside block statement (the same as biomejs)
The comment-position name could be arbitrary and suitable for the statements.
Kind of proposals
Describe the problem
While reviewing and testing out the new formatting functionality (#291) I ran into a number of issues related to comments. There are a lot of places where comments in the input VCL are lost in the parsing process and as a consequence are missing from the formatted output. While this is largely most relevant to the formatting efforts I'm making this a separate issue since at its root this is a parser problem.
To use an over the top example, for the following VCL input:
In this case only comments 1 and 6 are captured during parsing and a print of this node would yield
There are many other cases of valid comment positions that get lost during parsing, before semi-colons, trailing comments at the end of files, etc. Some of the loss is simply in the printing and the comments are attached to nodes. It's just the printing logic isn't emitting them, others are completely lost when tokens are read without a node being created for them.
TL;DR
We're currently losing comments but our existing comment attachment points should be flexible enough to correctly capture all possible comments. To do so will require tweaks to the parser to ensure all of them are correctly attached to their relevant node. Will submit a PR with a starting point implementation for discussion.
Comment attachment point analysis
Currently our AST node meta struct can handle a max of 3 comment attachment points (leading, infix, trailing) so it is important to confirm that this is enough to handle all cases. The following is an attempt to cover the VCL syntax and verify that our current model is sufficient to represent all comment positions as well as identifying what nodes and which positions each comment location should be attached to.
Position markers:
<start>
- comment immedietly preceding the node<end>
- comment immediatly following the node<after_keyword>
- comment immedietaly following a keyword<after_bracket>
- comment immedietly following a bracket ((
,{
)<x>
- comment is attached to a child nodeDeclarations:
Untyped declarations: (acl, table (implicit string), backend, ...)
attachment points: 3
Typed declarations: (table, sub)
attachment points: 3
Statements:
Single token statements: return (bare), break, fallthrough, esi, ...
attachment points: 3
Two token statements: return (value/state), log, unset ...
attachment points: 2
Three token statements: error
attachment points: 2
Four token statements: declare, set
attachment points: 3
attachment points: 2
Block statements:
attachment points: 1
attachment points: 3
Labels:
single token labels: (goto label)
attachment points: 1
keyword label: (default:)
case labels:
attachment points: 2
Expressions:
Ident:
attachment points: 2
Unary operator:
attachment points: 1
Binary operator:
Attachment points: 0
Grouped expressions:
attachment points: 2
Function call expression:
attachment points: 1
Function expression AST node stores arguments in a slice so when the expression has no arguments there is nowhere to attach the comments within the argument list. In this case the comment could be attached to the function expression.
attachment points: 2
Others:
acl entries
table entries
backend / director properties
Ambiguity:
In some cases there is ambiguity around which node a comment should be attached to.
Comment could be an comment for 200 or a comment for "OK"
Comment could be an <after_keyword> comment for the error node or comment for the status code.
For cases of single comments between a token and a value such as in the case above I think it makes more sense for the comment to be attached as a comment for the value. As the comment attachment logic for the declare statement doesn't need to worry about the comment and it can be handled by the ident parsing. For the same reason I think for situations like
declare local var.s <x> STRING;
attaching the comment as an comment forvar.s
makes the most sense. No need to special case the ident comment attachment logic since there is no formatting operation that would be impacted by which node the comment is attached to.For multiple comments between tokens should all be comments for 200, all comments for "OK", or split between them.
For single line statements like this again I think there is little value in attempting to apply meaning to the attachment for each of the comments between idents as there is no formatting rules that would be impacted by it.
For comments between two statements should the comment be an comment for the first statement or a comment for the next one.
Typically comments on their own line are associated with the statements following them vs a statement preceding them. So in situations like this the comment should be attached as a comment for the second log statement.
Similarly when multiple comments are between two statements.
Again for this case I would say the comments should be attached as comments for the second log statement.
Some uncertainty does come up when dealing with comments at the declaration level:
The simplest solution would be to consider the comments as comments for the second function declaration.
Conclusions
So far I have not found any syntax structures that would not be able to be represented with our current 3 comment attachment point model.
This is a lot to get through and much of this is notes for implementing a fix.
The text was updated successfully, but these errors were encountered: