Optionally add nodes for whitespace #41

fabpot · 2012-11-05T16:48:07Z

I don't know if that's part of the scope of PHP-Parser, but it would be useful (at least for me as it would greatly simplify the code for my PHP CS fixer -- http://cs.sensiolabs.org/) to be able to have nodes for whitespace. This should probably be an option as most of the time, you don't care about whitespace, and also because it would add many nodes.

nikic · 2012-11-11T22:10:21Z

I think having whitespace information available would be awesome for a lot of applications.. But I can't add whitespaces nodes into the current AST structure. To preserve all whitespace, comments and formatting one would need a concrete syntax tree that would basically a tree that tags stuff on the token stream. For "<?php\n echo 1 + 2;" it would something like this:

PHP[
    T_OPEN_TAG(<?php\n)
    T_WHITESPACE(     )
    Echo[
        T_ECHO(echo)
        T_WHITESPACE( )
        Plus[
            LNumber[
                T_LNUMBER(1)
            ]
            T_WHITESPACE( )
            char(+)
            T_WHITESPACE( )
            LNumber[
                T_LNUMBER(2)
            ]
        ]
        char(;)
    ]
]

This is a completely different structure and would probably require a lot of work to create. (I wanted to look into this far a while already, but never got around to doing it :/ )

Another interesting idea is what @schmittjoh implemented in https://github.com/schmittjoh/php-manipulator: It has the token stream and the AST as separate structures with links between them. I'll look a bit more closer at that and see if it can be integrated with the parser.

pscheit · 2012-11-12T06:45:58Z

correct me if i'm totally wrong here, but isn't the main problem, that the information of whitespace just don't get incorporated into the AST structure?
This is nearly the same problem as with the comments, isn't it?

I think for example fabpot's usecase is to correct indentation etc. But isn't for this the only need that the whitespace information is attached to the current elements we have in the AST?

like:

T_OPEN_Tag { indentation: 0, newline: true }
Echo { indentation: 4, inline-whitespace: 0 }
  LNumber { inline-whitespace: 1}
  char(+) { inline-whitespace: 1}
  LNumber { inline-whitespace: 1}
  char(;} {}

the whitespace information could be attached in a way that makes "most sense". I admit that this is the big part of the work. I used the convention to attach the inline-whitespace to the node after before this is more consistent with identation counting), but last whitespace would be lost.
This additional code-identation-information would allow to write indentation fixers and code-re-compilers

schmittjoh · 2012-11-12T07:18:39Z

Why not storing whitespace information in the existing nodes as attributes?

pscheit · 2012-11-12T07:19:33Z

that is what i wanted to say :)

nikic · 2012-11-12T07:42:17Z

@pscheit @schmittjoh Storing whitespace in the node attributes would come with the same problems that currently exist for comments, just worse. For comments I currently simply store all comments that occur before the node. That works in most cases, but not always (e.g. issue #36 and #37).

For whitespace it's a lot more problematic, because unlike comments whitespace usually also occurs within the node and not just before it. So for echo 1 + 2;, where does the whitespace go? It isn't really helpful if I add a whitespace = [' ', ' ', ' '] attribute to the node.

schmittjoh · 2012-11-12T07:54:27Z

You would need to create a custom format per node, but a separate structure makes all code that is written for the current structure incompatible which would be a huge drawback from my pov at least.

pscheit · 2012-11-12T08:01:39Z

if whitespace before and whitespace after and line breaks would be captured for every node, the indentation could be computed. This would store the same whitespace twice in a node.
Otherwise just the whitespace before could be captured or the whitespace coming after every node. It just has to be consistent..

phptag: after-whitespace: eol
echo: before-whitespace eol + 4 after-whitespace: 1
LNumber(1): before-whitespace: 1 after-whitespace: 1

etc
of course this is hugely verbose, but thats the point with indentation

jakoch · 2012-11-12T10:49:35Z

Well, extending the lexer to handle tokenization of whitespace and comments isn't too hard.
This information must then be passed on to the parser. If you need to avoid getting the current structure incompatible, then you might add a step between lexer and parser, something like a ParserFilter or ParserPrefilter, just to handle a seperate AST with whitespace and comment tokens next to the current AST.
Storing them as attributes in one AST is not good for tree walking, because you would need to walk the whole tree with subtrees, to adjust the values. When you have two ASTs the next step would be merging them.
I would add a rule to the first AST, so that every token is followed by a whitespace and then ask the other tree for the rest of the informations (this saves lots of tree merges in inline stmts, think of the 1+2 example).

nikic · 2012-11-17T00:14:03Z

@schmittjoh @pscheit Having a custom format for every node would cause a lot of work both in the implementation and the use lateron, because every node would require special handling.

That's why I currently lean towards the approach where just the offset into the token stream is saved in the nodes. So every node would have a startOffset and endOffset attribute, which could be used to look up the tokens it is composed off in the token stream. This is something that can be done automatically and the mechanisms for it are already in place. I quickly tried this out and it basically seems to work, though it looks like I sometimes get an off-by-one error for the end token (probably some bug in the handling for the end attributes).

I think this approach is very nice because it cleanly separates the abstract syntax tree and it's concrete formatting.

What will be a bit tricky about this is to figure out how to properly pretty print a partially changed AST. One somehow has to figure out which parts were changed and pretty print only those. And there too one would probably try to keep as much of their content in the original form.

pscheit · 2012-11-17T13:49:00Z

sounds good, at least something i can live with :)

because my current use case would really be just a partly changed AST (and i know exactly the parts, i have changed). I'm looking forward to this!

schmittjoh · 2012-11-20T13:20:38Z

This sounds very similar to the approach that I follow in php-manipulator.

The usage there is basically to jump to an AST node that you are interested in re-writing, at which point you would which to the token stream, re-write the node, and then switch back to the AST stream. If you can provide some mechanisms to make the mapping of AST node to tokens that would help me to simplify that code.

nikic · 2012-11-20T15:18:09Z

@schmittjoh I just fixed a small bug that was getting in the way of properly doing the mapping (cdbad02), but it should be as easy as using a custom lexer that specifies the necessary attributes:

<?php

class LexerWithTokenOffsets extends PHPParser_Lexer {
    public function getNextToken(&$value = null, &$startAttributes = null, &$endAttributes = null) {
        $tokenId = parent::getNextToken($value, $startAttributes, $endAttributes);
        $startAttributes['startOffset'] = $endAttributes['endOffset'] = $this->pos;
        return $tokenId;
    }
}

From a few quick tests this works fairly well, though there will be some rare cases where this will be off (those can be fixed without much effort).

lstrojny · 2012-12-17T23:26:13Z

This would be indeed very helpful for all sorts of transformations that can’t use Generation Gap.

nikic · 2012-12-23T19:19:45Z

My initial attempt at a pretty printer that tries to preserve the formatting: https://gist.github.com/4365484 It turned out really hard to do this (the main issue being indentation) and the current version still doesn't do a particularly good job.

@lstrojny What is Generation Gap?

lstrojny · 2012-12-24T16:42:40Z

Generation Gap (using subclasses as a means to split generated code from non-generated code): http://martinfowler.com/dslCatalog/generationGap.html

My point was: if I can’t do that (e.g. https://github.com/InterNations/ExceptionBundle) because I need to edit existing code, I need to preserve formatting and whitespace and everything else. Otherwise we’ll have unreadable changesets after applying code transformations.

pscheit · 2013-02-19T08:17:28Z

I attempted to write a whitespace preserving prettyPrinter as well. I did it the other way round:
I copied the token_get_all stream from the original PHP into an array indexed by offsets.

The pretty printer replaces only the offsets for tokens that are changed in this array. After that the array is imploded to php code again.

That helps a lot while finding gaps in the token stream and is easier to debug.

webmozart · 2014-01-09T12:47:41Z

Is there any progress on this issue? I would be really interested in partially modifying ASTs and writing them back to files.

joshuaspence · 2015-05-16T01:26:33Z

+1

funivan · 2015-06-08T06:09:20Z

👍

funivan · 2015-10-30T12:55:00Z

echo 1 + 1

EchoStmNode
  WhiteSpaceNode
  Number
  WhiteSpaceNode
  Plus
  WhiteSpaceNode
  Number

To pretty print this code you need go through all nodes and join their value.
Lets go further:

EchoStmNode  => echo
  WhiteSpaceNode => '  '
  Number => 1
  WhiteSpaceNode => '\n\n'
  Plus => +
  FunctionCall => count
    Symbol => (
    WhiteSpaceNode => '   '
    Variable => $items
    Symbol => )
  Semicolon => ;

This code will be represented as:

echo  1

+count(  $items);

White spaces goes inside node.
All modern ides use AST tree. So we can watch for their representation of Php code in AST. (Eclipse, PhpStorm, etc ..)

romm · 2015-11-29T12:13:10Z

Hey, I'd also be very interested in a solution for this issue.

I understand the actual structure is not made to support this, I just want you to know at least one more person is interested. :-)

Have a great day, and thanks for your work!

lisachenko · 2016-01-03T13:17:30Z

Any news for this issue? I wonder, will it be possible to use column+line information from each node instead of introducing a Whitespace node? Then printer will fill empty spaces, for example, for echo 1 + 2 it can check that 1 has position 7, so instead of single whitespace, it will produce two spaces, then checks next token + and add one more space, etc...

So, currently available attributes startTokenPos, endTokenPos, startFilePos, endFilePos can give all information to reconstruct the original source code with formatting and spacing. Not sure about tab character...

nikic · 2016-01-03T13:30:50Z

@lisachenko What I'm doing nowadays is use the AST to do analysis and figure out where things are, but do modifications on the source code (or the tokens) directly. You need some way to queue modifications (e.g. https://github.com/nikic/TypeUtil/blob/master/src/MutableString.php) and then this works pretty well for doing smaller changes based on startFilePos and endFilePos. Obviously this becomes a bit more tricky for larger changes.

As to making use of file/token offsets in the pretty printer, I gave that a try some time ago (https://gist.github.com/anonymous/4365484 linked above). I remember that the implementation was nowhere near robust enough. With enough effort one can probably get it to work (but I won't be working on it).

lisachenko · 2016-01-03T13:48:02Z

@nikic thank you for sharing your thoughts, I'll give a try for your pretty printer, it may suite my needs. Actually, I don't need an exact formatting, only to preserve line numbers, otherwise IDE shows crazy break points in the docblocks, statements and empty spaces ;)
Just combined AST, PHP stream wrapper and stream filter with node visitors to transform PHP source code on the fly. AST hook on userland side looks pretty good.

webmozart · 2016-01-03T17:52:37Z

@nikic Maybe you should close this issue then and add some docs about how to solve this :)

YuriyNasretdinov · 2016-04-04T10:42:59Z

For our soft mocks project we implemented a hack for the stock pretty-printer that allowed to mostly preserve all line numbers except for all kinds of braces because there literally is no line number information about these tokens.

The first part of the code consists of the following and it should work with current versions without major issues:

<?php
class SoftMocksPrinter extends \PhpParser\PrettyPrinter\Standard
{
    private $cur_ln;
    /**
     * Pretty prints an array of nodes (statements) and indents them optionally.
     *
     * @param \PhpParser\Node[] $nodes  Array of nodes
     * @param bool   $indent Whether to indent the printed nodes
     *
     * @return string Pretty printed statements
     */
    protected function pStmts(array $nodes, $indent = true)
    {
        $result = '';
        foreach ($nodes as $node) {
            $row = "";
            $cur_ln = $this->cur_ln;
            $comments = $this->pComments($node->getAttribute('comments', array()));
            $this->cur_ln += substr_count($comments, "\n");
            if ($node->getLine() > $this->cur_ln) {
                $row .= str_repeat("\n", $node->getLine() - $this->cur_ln);
                $this->cur_ln += substr_count($row, "\n");
            }
            $row .= $comments
                . $this->p($node)
                . ($node instanceof \PhpParser\Node\Expr ? ';' : '');
            $this->cur_ln = $cur_ln + substr_count($row, "\n"); // get rid of cur_ln modifications in deeper context
            $result .= $row;
        }
        if ($indent) {
            return preg_replace('~\n(?!$|' . $this->noIndentToken . ')~', "\n    ", $result);
        } else {
            return $result;
        }
    }
    public function prettyPrintFile(array $stmts)
    {
        $this->cur_ln = 1;
        $this->preprocessNodes($stmts);
        return "<?php " . str_replace("\n" . $this->noIndentToken, "\n", $this->pStmts($stmts, false));
    }
    protected function p(\PhpParser\Node $node)
    {
        $prefix = '';
        if ($node->getLine() > $this->cur_ln) {
            $prefix = str_repeat("\n", $node->getLine() - $this->cur_ln);
            $this->cur_ln = $node->getLine();
        }
        return $prefix . $this->{'p' . $node->getType()}($node);
    }

The other part is removing all "\n" from pSmth() functions so that we manually control all whitespace characters.

The problem, obviously, is with this "other part" because it requires writing fragile code that is just a copy of all parent functions. We could not just make calls to parent methods and remove all "\n"'s because functions are recursive.

So I would suggest adding some code into PHP-Parser that would allow some kind of line-preserving pretty-printing by adding possibility to get rid of "\n"'s everywhere (e.g. move "\n" to a public propertly that could be redeclared to be "")

lisachenko · 2016-04-04T11:13:32Z

@nikic @YuriyNasretdinov All Java tools (for example, IDEA) has a special node-type for storing an information about whitespaces, so it would be nice to follow this. But this should be an optional feature for Parser - to capture whitespaces as an AST-nodes. This will increase the complexity of AST-walking, but will give a control over file positions for different elements.

jails · 2016-09-17T17:13:47Z

Just wondering if this is issue has been addressed in the 3.x version since all comments seems to be older than 6 months ?

nikic · 2016-09-17T18:30:59Z

@jails Nope, no changes here.

nikic · 2016-12-26T17:53:10Z

So, I've given this problem (format-preserving pretty prints) another shot and I have a viable prototype now: https://github.com/nikic/PHP-Parser/compare/formatPreservingPrint (based on #322)

The implementation does not add whitespace nodes in the AST, instead we try to reconstruct the original formatting based on token offset information.

Here is a usage example:

$lexer = new Lexer\Emulative([
    'usedAttributes' => [
        'comments',
        'startLine', 'endLine',
        'startTokenPos', 'endTokenPos',
    ],
]);
$parser = new Parser\Php7($lexer, [
    'useIdentifierNodes' => true,
    'useConsistentVariableNodes' => true,
    'useExpressionStatements' => true,
    'useNopStatements' => false,
]);

$traverser = new NodeTraverser();
$traverser->addVisitor(new NodeVisitor\CloningVisitor());

$printer = new PrettyPrinter\Standard();

$oldStmts = $parser->parse($code);
$oldTokens = $lexer->getTokens();

$newStmts = $traverser->traverse($oldStmts);
// MODIFY $newStmts HERE

$newCode = $printer->printFormatPreserving($newStmts, $oldStmts, $oldTokens);

The important bits here are a) that you need to specify a bunch of non-standard options to avoid BC breaks until PHP-Parser 4.0, b) a CloningVisitor is run before any changes, so we retain the original AST as a reference and c) we also need the old tokens from the lexer.

The whole thing isn't well tested yet, so likely going to be many issues depending on which part of the AST you try to modify. There's a couple of limitations as to where we can preserve formatting and where we can't:

Changes to non-Node subnodes results in formatting being lost. The reason is that we don't have offset information for non-Node subnodes. Some of the parser options specified above (e.g. useIdentifierNodes) make use of Nodes where previously strings were used. For example, this allows changing the name of a function while preserving formatting. On the other hand changing a function to return by ref (this is a boolean subnode) will drop formatting.
We can preserve formatting if an item in an array subnode is replaced, but not if an item is removed or added. We can do better here, but this will require a good bit of work (would probably need something like an array differ).
Formatting for anon classes is lost, because the attribute assignment in this case is too weird right now.

nikic · 2016-12-26T20:59:15Z

Here a complete example: https://gist.github.com/nikic/3229644ada5576622d7d538f6bff2098
This is basically https://github.com/Roave/FunctionFQNReplacer, but preserving formatting.

nikic · 2017-01-24T12:07:23Z

Closing this in favor of #344, which contains remaining TODOs for this.

rentalhost · 2019-04-24T18:52:48Z

Instead of create a new T_WHITESPACE, it is not possible add it directly to the next node like "comments" does? Something like:

$lexer = new PhpParser\Lexer([ 'usedAttributes' => [ 'whitespaces' ]]);

Expr_FuncCall(
    ...
    attributes: {
        whitespaces: "\s\s\s\s\s\s\r\n\t\t\s\s"
    }
)

nikic mentioned this issue Nov 21, 2012

Node/Scalar/String.php's parseEscapeSequences() breaks binary strings for (xml) serialization #38

Closed

nikic mentioned this issue Aug 23, 2013

PSR-2 #70

Closed

nikic mentioned this issue Sep 6, 2013

Prettyprinter renders Embedded \n in " strings is into a newline #71

Closed

This was referenced Sep 27, 2014

How distinct method with and without public visibility ? #136

Closed

Php parser array and comments trubble #37

Closed

rr- mentioned this issue Sep 30, 2014

Warning about "unused code after RETURN." PHPCheckstyle/phpcheckstyle#40

Open

nicolas-grekas mentioned this issue Nov 13, 2014

[Info] Working on parsers in PHP #142

Closed

nikic mentioned this issue Mar 2, 2015

Request: generated PHP identical to parsed PHP #179

Closed

keradus mentioned this issue Mar 3, 2015

Added php7 to travis build PHP-CS-Fixer/PHP-CS-Fixer#1036

Closed

popstikle2 mentioned this issue Aug 19, 2015

HEREDOC Pretty printing #216

Closed

nikic mentioned this issue Jun 9, 2016

Parse, change and generate file #282

Closed

This was referenced Jan 18, 2017

Name nodes for property names etc #322

Closed

Empty lines will get ignored #305

Closed

nikic added this to the 4.0 milestone Jan 23, 2017

nikic mentioned this issue Jan 24, 2017

Format-preserving AST transformations TODO #344

Closed

26 tasks

nikic closed this as completed Jan 24, 2017

TomasVotruba mentioned this issue Aug 7, 2017

Made improvements to pretty printing #116

Closed

TomasVotruba referenced this issue in rectorphp/rector Oct 16, 2018

[CodeQuality] Add SimplifyFalseComparisonToNegateRector

a7afe1f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally add nodes for whitespace #41

Optionally add nodes for whitespace #41

fabpot commented Nov 5, 2012

nikic commented Nov 11, 2012

pscheit commented Nov 12, 2012

schmittjoh commented Nov 12, 2012

pscheit commented Nov 12, 2012

nikic commented Nov 12, 2012

schmittjoh commented Nov 12, 2012

pscheit commented Nov 12, 2012

jakoch commented Nov 12, 2012

nikic commented Nov 17, 2012

pscheit commented Nov 17, 2012

schmittjoh commented Nov 20, 2012

nikic commented Nov 20, 2012

lstrojny commented Dec 17, 2012

nikic commented Dec 23, 2012

lstrojny commented Dec 24, 2012

pscheit commented Feb 19, 2013

webmozart commented Jan 9, 2014

joshuaspence commented May 16, 2015

funivan commented Jun 8, 2015

funivan commented Oct 30, 2015

romm commented Nov 29, 2015

lisachenko commented Jan 3, 2016

nikic commented Jan 3, 2016

lisachenko commented Jan 3, 2016

webmozart commented Jan 3, 2016

YuriyNasretdinov commented Apr 4, 2016

lisachenko commented Apr 4, 2016

jails commented Sep 17, 2016

nikic commented Sep 17, 2016

nikic commented Dec 26, 2016

nikic commented Dec 26, 2016

nikic commented Jan 24, 2017

rentalhost commented Apr 24, 2019