Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally add nodes for whitespace #41

Closed
fabpot opened this issue Nov 5, 2012 · 33 comments
Closed

Optionally add nodes for whitespace #41

fabpot opened this issue Nov 5, 2012 · 33 comments
Milestone

Comments

@fabpot
Copy link
Contributor

fabpot commented Nov 5, 2012

I don't know if that's part of the scope of PHP-Parser, but it would be useful (at least for me as it would greatly simplify the code for my PHP CS fixer -- http://cs.sensiolabs.org/) to be able to have nodes for whitespace. This should probably be an option as most of the time, you don't care about whitespace, and also because it would add many nodes.

@nikic
Copy link
Owner

nikic commented Nov 11, 2012

I think having whitespace information available would be awesome for a lot of applications.. But I can't add whitespaces nodes into the current AST structure. To preserve all whitespace, comments and formatting one would need a concrete syntax tree that would basically a tree that tags stuff on the token stream. For "<?php\n echo 1 + 2;" it would something like this:

PHP[
    T_OPEN_TAG(<?php\n)
    T_WHITESPACE(     )
    Echo[
        T_ECHO(echo)
        T_WHITESPACE( )
        Plus[
            LNumber[
                T_LNUMBER(1)
            ]
            T_WHITESPACE( )
            char(+)
            T_WHITESPACE( )
            LNumber[
                T_LNUMBER(2)
            ]
        ]
        char(;)
    ]
]

This is a completely different structure and would probably require a lot of work to create. (I wanted to look into this far a while already, but never got around to doing it :/ )

Another interesting idea is what @schmittjoh implemented in https://github.com/schmittjoh/php-manipulator: It has the token stream and the AST as separate structures with links between them. I'll look a bit more closer at that and see if it can be integrated with the parser.

@pscheit
Copy link

pscheit commented Nov 12, 2012

correct me if i'm totally wrong here, but isn't the main problem, that the information of whitespace just don't get incorporated into the AST structure?
This is nearly the same problem as with the comments, isn't it?

I think for example fabpot's usecase is to correct indentation etc. But isn't for this the only need that the whitespace information is attached to the current elements we have in the AST?

like:

T_OPEN_Tag { indentation: 0, newline: true }
Echo { indentation: 4, inline-whitespace: 0 }
  LNumber { inline-whitespace: 1}
  char(+) { inline-whitespace: 1}
  LNumber { inline-whitespace: 1}
  char(;} {}

the whitespace information could be attached in a way that makes "most sense". I admit that this is the big part of the work. I used the convention to attach the inline-whitespace to the node after before this is more consistent with identation counting), but last whitespace would be lost.
This additional code-identation-information would allow to write indentation fixers and code-re-compilers

@schmittjoh
Copy link
Contributor

Why not storing whitespace information in the existing nodes as attributes?

@pscheit
Copy link

pscheit commented Nov 12, 2012

that is what i wanted to say :)

@nikic
Copy link
Owner

nikic commented Nov 12, 2012

@pscheit @schmittjoh Storing whitespace in the node attributes would come with the same problems that currently exist for comments, just worse. For comments I currently simply store all comments that occur before the node. That works in most cases, but not always (e.g. issue #36 and #37).

For whitespace it's a lot more problematic, because unlike comments whitespace usually also occurs within the node and not just before it. So for echo 1 + 2;, where does the whitespace go? It isn't really helpful if I add a whitespace = [' ', ' ', ' '] attribute to the node.

@schmittjoh
Copy link
Contributor

You would need to create a custom format per node, but a separate structure makes all code that is written for the current structure incompatible which would be a huge drawback from my pov at least.

@pscheit
Copy link

pscheit commented Nov 12, 2012

if whitespace before and whitespace after and line breaks would be captured for every node, the indentation could be computed. This would store the same whitespace twice in a node.
Otherwise just the whitespace before could be captured or the whitespace coming after every node. It just has to be consistent..

phptag: after-whitespace: eol
echo: before-whitespace eol + 4 after-whitespace: 1
LNumber(1): before-whitespace: 1 after-whitespace: 1

etc
of course this is hugely verbose, but thats the point with indentation

@jakoch
Copy link

jakoch commented Nov 12, 2012

Well, extending the lexer to handle tokenization of whitespace and comments isn't too hard.
This information must then be passed on to the parser. If you need to avoid getting the current structure incompatible, then you might add a step between lexer and parser, something like a ParserFilter or ParserPrefilter, just to handle a seperate AST with whitespace and comment tokens next to the current AST.
Storing them as attributes in one AST is not good for tree walking, because you would need to walk the whole tree with subtrees, to adjust the values. When you have two ASTs the next step would be merging them.
I would add a rule to the first AST, so that every token is followed by a whitespace and then ask the other tree for the rest of the informations (this saves lots of tree merges in inline stmts, think of the 1+2 example).

@nikic
Copy link
Owner

nikic commented Nov 17, 2012

@schmittjoh @pscheit Having a custom format for every node would cause a lot of work both in the implementation and the use lateron, because every node would require special handling.

That's why I currently lean towards the approach where just the offset into the token stream is saved in the nodes. So every node would have a startOffset and endOffset attribute, which could be used to look up the tokens it is composed off in the token stream. This is something that can be done automatically and the mechanisms for it are already in place. I quickly tried this out and it basically seems to work, though it looks like I sometimes get an off-by-one error for the end token (probably some bug in the handling for the end attributes).

I think this approach is very nice because it cleanly separates the abstract syntax tree and it's concrete formatting.

What will be a bit tricky about this is to figure out how to properly pretty print a partially changed AST. One somehow has to figure out which parts were changed and pretty print only those. And there too one would probably try to keep as much of their content in the original form.

@pscheit
Copy link

pscheit commented Nov 17, 2012

sounds good, at least something i can live with :)

because my current use case would really be just a partly changed AST (and i know exactly the parts, i have changed). I'm looking forward to this!

@schmittjoh
Copy link
Contributor

This sounds very similar to the approach that I follow in php-manipulator.

The usage there is basically to jump to an AST node that you are interested in re-writing, at which point you would which to the token stream, re-write the node, and then switch back to the AST stream. If you can provide some mechanisms to make the mapping of AST node to tokens that would help me to simplify that code.

@nikic
Copy link
Owner

nikic commented Nov 20, 2012

@schmittjoh I just fixed a small bug that was getting in the way of properly doing the mapping (cdbad02), but it should be as easy as using a custom lexer that specifies the necessary attributes:

<?php

class LexerWithTokenOffsets extends PHPParser_Lexer {
    public function getNextToken(&$value = null, &$startAttributes = null, &$endAttributes = null) {
        $tokenId = parent::getNextToken($value, $startAttributes, $endAttributes);
        $startAttributes['startOffset'] = $endAttributes['endOffset'] = $this->pos;
        return $tokenId;
    }
}

From a few quick tests this works fairly well, though there will be some rare cases where this will be off (those can be fixed without much effort).

@lstrojny
Copy link

This would be indeed very helpful for all sorts of transformations that can’t use Generation Gap.

@nikic
Copy link
Owner

nikic commented Dec 23, 2012

My initial attempt at a pretty printer that tries to preserve the formatting: https://gist.github.com/4365484 It turned out really hard to do this (the main issue being indentation) and the current version still doesn't do a particularly good job.

@lstrojny What is Generation Gap?

@lstrojny
Copy link

Generation Gap (using subclasses as a means to split generated code from non-generated code): http://martinfowler.com/dslCatalog/generationGap.html

My point was: if I can’t do that (e.g. https://github.com/InterNations/ExceptionBundle) because I need to edit existing code, I need to preserve formatting and whitespace and everything else. Otherwise we’ll have unreadable changesets after applying code transformations.

@pscheit
Copy link

pscheit commented Feb 19, 2013

I attempted to write a whitespace preserving prettyPrinter as well. I did it the other way round:
I copied the token_get_all stream from the original PHP into an array indexed by offsets.

The pretty printer replaces only the offsets for tokens that are changed in this array. After that the array is imploded to php code again.

That helps a lot while finding gaps in the token stream and is easier to debug.

@webmozart
Copy link

Is there any progress on this issue? I would be really interested in partially modifying ASTs and writing them back to files.

@joshuaspence
Copy link

+1

1 similar comment
@funivan
Copy link

funivan commented Jun 8, 2015

👍

@funivan
Copy link

funivan commented Oct 30, 2015

echo 1 + 1

EchoStmNode
  WhiteSpaceNode
  Number
  WhiteSpaceNode
  Plus
  WhiteSpaceNode
  Number

To pretty print this code you need go through all nodes and join their value.
Lets go further:

EchoStmNode  => echo
  WhiteSpaceNode => '  '
  Number => 1
  WhiteSpaceNode => '\n\n'
  Plus => +
  FunctionCall => count
    Symbol => (
    WhiteSpaceNode => '   '
    Variable => $items
    Symbol => )
  Semicolon => ;

This code will be represented as:

echo  1

+count(  $items);

White spaces goes inside node.
All modern ides use AST tree. So we can watch for their representation of Php code in AST. (Eclipse, PhpStorm, etc ..)

@romm
Copy link
Contributor

romm commented Nov 29, 2015

Hey, I'd also be very interested in a solution for this issue.

I understand the actual structure is not made to support this, I just want you to know at least one more person is interested. :-)

Have a great day, and thanks for your work!

@lisachenko
Copy link
Contributor

Any news for this issue? I wonder, will it be possible to use column+line information from each node instead of introducing a Whitespace node? Then printer will fill empty spaces, for example, for echo 1 + 2 it can check that 1 has position 7, so instead of single whitespace, it will produce two spaces, then checks next token + and add one more space, etc...

So, currently available attributes startTokenPos, endTokenPos, startFilePos, endFilePos can give all information to reconstruct the original source code with formatting and spacing. Not sure about tab character...

@nikic
Copy link
Owner

nikic commented Jan 3, 2016

@lisachenko What I'm doing nowadays is use the AST to do analysis and figure out where things are, but do modifications on the source code (or the tokens) directly. You need some way to queue modifications (e.g. https://github.com/nikic/TypeUtil/blob/master/src/MutableString.php) and then this works pretty well for doing smaller changes based on startFilePos and endFilePos. Obviously this becomes a bit more tricky for larger changes.

As to making use of file/token offsets in the pretty printer, I gave that a try some time ago (https://gist.github.com/anonymous/4365484 linked above). I remember that the implementation was nowhere near robust enough. With enough effort one can probably get it to work (but I won't be working on it).

@lisachenko
Copy link
Contributor

@nikic thank you for sharing your thoughts, I'll give a try for your pretty printer, it may suite my needs. Actually, I don't need an exact formatting, only to preserve line numbers, otherwise IDE shows crazy break points in the docblocks, statements and empty spaces ;)
Just combined AST, PHP stream wrapper and stream filter with node visitors to transform PHP source code on the fly. AST hook on userland side looks pretty good.

@webmozart
Copy link

@nikic Maybe you should close this issue then and add some docs about how to solve this :)

@YuriyNasretdinov
Copy link

For our soft mocks project we implemented a hack for the stock pretty-printer that allowed to mostly preserve all line numbers except for all kinds of braces because there literally is no line number information about these tokens.

The first part of the code consists of the following and it should work with current versions without major issues:

<?php
class SoftMocksPrinter extends \PhpParser\PrettyPrinter\Standard
{
    private $cur_ln;
    /**
     * Pretty prints an array of nodes (statements) and indents them optionally.
     *
     * @param \PhpParser\Node[] $nodes  Array of nodes
     * @param bool   $indent Whether to indent the printed nodes
     *
     * @return string Pretty printed statements
     */
    protected function pStmts(array $nodes, $indent = true)
    {
        $result = '';
        foreach ($nodes as $node) {
            $row = "";
            $cur_ln = $this->cur_ln;
            $comments = $this->pComments($node->getAttribute('comments', array()));
            $this->cur_ln += substr_count($comments, "\n");
            if ($node->getLine() > $this->cur_ln) {
                $row .= str_repeat("\n", $node->getLine() - $this->cur_ln);
                $this->cur_ln += substr_count($row, "\n");
            }
            $row .= $comments
                . $this->p($node)
                . ($node instanceof \PhpParser\Node\Expr ? ';' : '');
            $this->cur_ln = $cur_ln + substr_count($row, "\n"); // get rid of cur_ln modifications in deeper context
            $result .= $row;
        }
        if ($indent) {
            return preg_replace('~\n(?!$|' . $this->noIndentToken . ')~', "\n    ", $result);
        } else {
            return $result;
        }
    }
    public function prettyPrintFile(array $stmts)
    {
        $this->cur_ln = 1;
        $this->preprocessNodes($stmts);
        return "<?php " . str_replace("\n" . $this->noIndentToken, "\n", $this->pStmts($stmts, false));
    }
    protected function p(\PhpParser\Node $node)
    {
        $prefix = '';
        if ($node->getLine() > $this->cur_ln) {
            $prefix = str_repeat("\n", $node->getLine() - $this->cur_ln);
            $this->cur_ln = $node->getLine();
        }
        return $prefix . $this->{'p' . $node->getType()}($node);
    }

The other part is removing all "\n" from pSmth() functions so that we manually control all whitespace characters.

The problem, obviously, is with this "other part" because it requires writing fragile code that is just a copy of all parent functions. We could not just make calls to parent methods and remove all "\n"'s because functions are recursive.

So I would suggest adding some code into PHP-Parser that would allow some kind of line-preserving pretty-printing by adding possibility to get rid of "\n"'s everywhere (e.g. move "\n" to a public propertly that could be redeclared to be "")

@lisachenko
Copy link
Contributor

@nikic @YuriyNasretdinov All Java tools (for example, IDEA) has a special node-type for storing an information about whitespaces, so it would be nice to follow this. But this should be an optional feature for Parser - to capture whitespaces as an AST-nodes. This will increase the complexity of AST-walking, but will give a control over file positions for different elements.

@jails
Copy link

jails commented Sep 17, 2016

Just wondering if this is issue has been addressed in the 3.x version since all comments seems to be older than 6 months ?

@nikic
Copy link
Owner

nikic commented Sep 17, 2016

@jails Nope, no changes here.

@nikic
Copy link
Owner

nikic commented Dec 26, 2016

So, I've given this problem (format-preserving pretty prints) another shot and I have a viable prototype now: https://github.com/nikic/PHP-Parser/compare/formatPreservingPrint (based on #322)

The implementation does not add whitespace nodes in the AST, instead we try to reconstruct the original formatting based on token offset information.

Here is a usage example:

$lexer = new Lexer\Emulative([
    'usedAttributes' => [
        'comments',
        'startLine', 'endLine',
        'startTokenPos', 'endTokenPos',
    ],
]);
$parser = new Parser\Php7($lexer, [
    'useIdentifierNodes' => true,
    'useConsistentVariableNodes' => true,
    'useExpressionStatements' => true,
    'useNopStatements' => false,
]);

$traverser = new NodeTraverser();
$traverser->addVisitor(new NodeVisitor\CloningVisitor());

$printer = new PrettyPrinter\Standard();

$oldStmts = $parser->parse($code);
$oldTokens = $lexer->getTokens();

$newStmts = $traverser->traverse($oldStmts);
// MODIFY $newStmts HERE

$newCode = $printer->printFormatPreserving($newStmts, $oldStmts, $oldTokens);

The important bits here are a) that you need to specify a bunch of non-standard options to avoid BC breaks until PHP-Parser 4.0, b) a CloningVisitor is run before any changes, so we retain the original AST as a reference and c) we also need the old tokens from the lexer.

The whole thing isn't well tested yet, so likely going to be many issues depending on which part of the AST you try to modify. There's a couple of limitations as to where we can preserve formatting and where we can't:

  • Changes to non-Node subnodes results in formatting being lost. The reason is that we don't have offset information for non-Node subnodes. Some of the parser options specified above (e.g. useIdentifierNodes) make use of Nodes where previously strings were used. For example, this allows changing the name of a function while preserving formatting. On the other hand changing a function to return by ref (this is a boolean subnode) will drop formatting.
  • We can preserve formatting if an item in an array subnode is replaced, but not if an item is removed or added. We can do better here, but this will require a good bit of work (would probably need something like an array differ).
  • Formatting for anon classes is lost, because the attribute assignment in this case is too weird right now.

@nikic
Copy link
Owner

nikic commented Dec 26, 2016

Here a complete example: https://gist.github.com/nikic/3229644ada5576622d7d538f6bff2098
This is basically https://github.com/Roave/FunctionFQNReplacer, but preserving formatting.

@nikic nikic added this to the 4.0 milestone Jan 23, 2017
@nikic
Copy link
Owner

nikic commented Jan 24, 2017

Closing this in favor of #344, which contains remaining TODOs for this.

@rentalhost
Copy link

Instead of create a new T_WHITESPACE, it is not possible add it directly to the next node like "comments" does? Something like:

$lexer = new PhpParser\Lexer([ 'usedAttributes' => [ 'whitespaces' ]]);
Expr_FuncCall(
    ...
    attributes: {
        whitespaces: "\s\s\s\s\s\s\r\n\t\t\s\s"
    }
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests