-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally add nodes for whitespace #41
Comments
I think having whitespace information available would be awesome for a lot of applications.. But I can't add whitespaces nodes into the current AST structure. To preserve all whitespace, comments and formatting one would need a concrete syntax tree that would basically a tree that tags stuff on the token stream. For "<?php\n echo 1 + 2;" it would something like this:
This is a completely different structure and would probably require a lot of work to create. (I wanted to look into this far a while already, but never got around to doing it :/ ) Another interesting idea is what @schmittjoh implemented in https://github.com/schmittjoh/php-manipulator: It has the token stream and the AST as separate structures with links between them. I'll look a bit more closer at that and see if it can be integrated with the parser. |
correct me if i'm totally wrong here, but isn't the main problem, that the information of whitespace just don't get incorporated into the AST structure? I think for example fabpot's usecase is to correct indentation etc. But isn't for this the only need that the whitespace information is attached to the current elements we have in the AST? like:
the whitespace information could be attached in a way that makes "most sense". I admit that this is the big part of the work. I used the convention to attach the inline-whitespace to the node after before this is more consistent with identation counting), but last whitespace would be lost. |
Why not storing whitespace information in the existing nodes as attributes? |
that is what i wanted to say :) |
@pscheit @schmittjoh Storing whitespace in the node attributes would come with the same problems that currently exist for comments, just worse. For comments I currently simply store all comments that occur before the node. That works in most cases, but not always (e.g. issue #36 and #37). For whitespace it's a lot more problematic, because unlike comments whitespace usually also occurs within the node and not just before it. So for |
You would need to create a custom format per node, but a separate structure makes all code that is written for the current structure incompatible which would be a huge drawback from my pov at least. |
if whitespace before and whitespace after and line breaks would be captured for every node, the indentation could be computed. This would store the same whitespace twice in a node. phptag: after-whitespace: eol etc |
Well, extending the lexer to handle tokenization of whitespace and comments isn't too hard. |
@schmittjoh @pscheit Having a custom format for every node would cause a lot of work both in the implementation and the use lateron, because every node would require special handling. That's why I currently lean towards the approach where just the offset into the token stream is saved in the nodes. So every node would have a startOffset and endOffset attribute, which could be used to look up the tokens it is composed off in the token stream. This is something that can be done automatically and the mechanisms for it are already in place. I quickly tried this out and it basically seems to work, though it looks like I sometimes get an off-by-one error for the end token (probably some bug in the handling for the end attributes). I think this approach is very nice because it cleanly separates the abstract syntax tree and it's concrete formatting. What will be a bit tricky about this is to figure out how to properly pretty print a partially changed AST. One somehow has to figure out which parts were changed and pretty print only those. And there too one would probably try to keep as much of their content in the original form. |
sounds good, at least something i can live with :) because my current use case would really be just a partly changed AST (and i know exactly the parts, i have changed). I'm looking forward to this! |
This sounds very similar to the approach that I follow in php-manipulator. The usage there is basically to jump to an AST node that you are interested in re-writing, at which point you would which to the token stream, re-write the node, and then switch back to the AST stream. If you can provide some mechanisms to make the mapping of AST node to tokens that would help me to simplify that code. |
@schmittjoh I just fixed a small bug that was getting in the way of properly doing the mapping (cdbad02), but it should be as easy as using a custom lexer that specifies the necessary attributes: <?php
class LexerWithTokenOffsets extends PHPParser_Lexer {
public function getNextToken(&$value = null, &$startAttributes = null, &$endAttributes = null) {
$tokenId = parent::getNextToken($value, $startAttributes, $endAttributes);
$startAttributes['startOffset'] = $endAttributes['endOffset'] = $this->pos;
return $tokenId;
}
} From a few quick tests this works fairly well, though there will be some rare cases where this will be off (those can be fixed without much effort). |
This would be indeed very helpful for all sorts of transformations that can’t use Generation Gap. |
My initial attempt at a pretty printer that tries to preserve the formatting: https://gist.github.com/4365484 It turned out really hard to do this (the main issue being indentation) and the current version still doesn't do a particularly good job. @lstrojny What is Generation Gap? |
Generation Gap (using subclasses as a means to split generated code from non-generated code): http://martinfowler.com/dslCatalog/generationGap.html My point was: if I can’t do that (e.g. https://github.com/InterNations/ExceptionBundle) because I need to edit existing code, I need to preserve formatting and whitespace and everything else. Otherwise we’ll have unreadable changesets after applying code transformations. |
I attempted to write a whitespace preserving prettyPrinter as well. I did it the other way round: The pretty printer replaces only the offsets for tokens that are changed in this array. After that the array is imploded to php code again. That helps a lot while finding gaps in the token stream and is easier to debug. |
Is there any progress on this issue? I would be really interested in partially modifying ASTs and writing them back to files. |
+1 |
1 similar comment
👍 |
To pretty print this code you need go through all nodes and join their value.
This code will be represented as: echo 1
+count( $items); White spaces goes inside node. |
Hey, I'd also be very interested in a solution for this issue. I understand the actual structure is not made to support this, I just want you to know at least one more person is interested. :-) Have a great day, and thanks for your work! |
Any news for this issue? I wonder, will it be possible to use column+line information from each node instead of introducing a Whitespace node? Then printer will fill empty spaces, for example, for So, currently available attributes |
@lisachenko What I'm doing nowadays is use the AST to do analysis and figure out where things are, but do modifications on the source code (or the tokens) directly. You need some way to queue modifications (e.g. https://github.com/nikic/TypeUtil/blob/master/src/MutableString.php) and then this works pretty well for doing smaller changes based on startFilePos and endFilePos. Obviously this becomes a bit more tricky for larger changes. As to making use of file/token offsets in the pretty printer, I gave that a try some time ago (https://gist.github.com/anonymous/4365484 linked above). I remember that the implementation was nowhere near robust enough. With enough effort one can probably get it to work (but I won't be working on it). |
@nikic thank you for sharing your thoughts, I'll give a try for your pretty printer, it may suite my needs. Actually, I don't need an exact formatting, only to preserve line numbers, otherwise IDE shows crazy break points in the docblocks, statements and empty spaces ;) |
@nikic Maybe you should close this issue then and add some docs about how to solve this :) |
For our soft mocks project we implemented a hack for the stock pretty-printer that allowed to mostly preserve all line numbers except for all kinds of braces because there literally is no line number information about these tokens. The first part of the code consists of the following and it should work with current versions without major issues:
The other part is removing all "\n" from pSmth() functions so that we manually control all whitespace characters. The problem, obviously, is with this "other part" because it requires writing fragile code that is just a copy of all parent functions. We could not just make calls to parent methods and remove all "\n"'s because functions are recursive. So I would suggest adding some code into PHP-Parser that would allow some kind of line-preserving pretty-printing by adding possibility to get rid of "\n"'s everywhere (e.g. move "\n" to a public propertly that could be redeclared to be "") |
@nikic @YuriyNasretdinov All Java tools (for example, IDEA) has a special node-type for storing an information about whitespaces, so it would be nice to follow this. But this should be an optional feature for Parser - to capture whitespaces as an AST-nodes. This will increase the complexity of AST-walking, but will give a control over file positions for different elements. |
Just wondering if this is issue has been addressed in the 3.x version since all comments seems to be older than 6 months ? |
@jails Nope, no changes here. |
So, I've given this problem (format-preserving pretty prints) another shot and I have a viable prototype now: https://github.com/nikic/PHP-Parser/compare/formatPreservingPrint (based on #322) The implementation does not add whitespace nodes in the AST, instead we try to reconstruct the original formatting based on token offset information. Here is a usage example: $lexer = new Lexer\Emulative([
'usedAttributes' => [
'comments',
'startLine', 'endLine',
'startTokenPos', 'endTokenPos',
],
]);
$parser = new Parser\Php7($lexer, [
'useIdentifierNodes' => true,
'useConsistentVariableNodes' => true,
'useExpressionStatements' => true,
'useNopStatements' => false,
]);
$traverser = new NodeTraverser();
$traverser->addVisitor(new NodeVisitor\CloningVisitor());
$printer = new PrettyPrinter\Standard();
$oldStmts = $parser->parse($code);
$oldTokens = $lexer->getTokens();
$newStmts = $traverser->traverse($oldStmts);
// MODIFY $newStmts HERE
$newCode = $printer->printFormatPreserving($newStmts, $oldStmts, $oldTokens); The important bits here are a) that you need to specify a bunch of non-standard options to avoid BC breaks until PHP-Parser 4.0, b) a CloningVisitor is run before any changes, so we retain the original AST as a reference and c) we also need the old tokens from the lexer. The whole thing isn't well tested yet, so likely going to be many issues depending on which part of the AST you try to modify. There's a couple of limitations as to where we can preserve formatting and where we can't:
|
Here a complete example: https://gist.github.com/nikic/3229644ada5576622d7d538f6bff2098 |
Closing this in favor of #344, which contains remaining TODOs for this. |
Instead of create a new $lexer = new PhpParser\Lexer([ 'usedAttributes' => [ 'whitespaces' ]]);
|
I don't know if that's part of the scope of PHP-Parser, but it would be useful (at least for me as it would greatly simplify the code for my PHP CS fixer -- http://cs.sensiolabs.org/) to be able to have nodes for whitespace. This should probably be an option as most of the time, you don't care about whitespace, and also because it would add many nodes.
The text was updated successfully, but these errors were encountered: