Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: internal/FormatWriter simplify two regexen; fix pipe char bug #1924

Conversation

LeeTibbert
Copy link
Contributor

@LeeTibbert LeeTibbert commented Apr 28, 2020

  • This PR was motivated by recent discussions related to Scalafmt
    PRs FormatWriter: use \h for horizontal whitespace #1916, FormatWriter bugfix: add margin in the middle only #1917, and those mentioned in those PRs.

  • The primary purpose of this PR is to provide a concrete example
    of how the regular expressions for leadingAsteriskSpace and
    compileStripMarginPattern be simplified.

    Discussions about readability are always subjective. I believe
    that removing the positive and negative lookaheads makes it
    more likely that the next maintainer will be able to read and
    comprehend the code.

    Discussions about runtime performance improvements without
    demonstrable data are worth than useless. lookaheads are
    known to have a high runtime cost. It is believable, but not
    asserted by this PR for lack of evidence, that the code of this
    PR will have better performance characteristics.

  • Whilst simplifying the two regex expressions, I noticed what I
    believe to be a bug in compileStripMarginPattern. Scala allows
    any character other than NULL to be used as a pipe character.
    Java requires that Dollar ($) and backslash () characters
    in the replacement string be quoted (e.g. in a string: "\$").
    The code prior to this PR did not do this.

  • The handling of compileStripMarginPattern is a bit tricky. Every/any regex expression should
    be sealed with the Curse of the Mummy, but that one is particularly deserving.

    The active pipe character is not in scope where the replaceAll() is done.
    Using a capture group allows the pipe character from where the pattern is compiled to be
    used in replaceAll(),

    Agreed, a code smell, but, as a guest, I wanted/needed to minimize changes to the
    existing code. One capture group is not all that expensive. The "[^*]" character class
    in leadingAsteriskSpace probably takes more memory.

  • The code of this PR passes all current scalafmt tests:
    sbt> tests/test

[info] ScalaTest
[info] Run completed in 1 minute, 23 seconds.
[info] Total number of tests run: 2406
[info] Suites: completed 18, aborted 0
[info] Tests: succeeded 2406, failed 0, canceled 0, ignored 8, pending 0
[info] All tests passed.
[info] Passed: Total 2406, Failed 0, Errors 0, Passed 2406, Ignored 8
[success] Total time: 90 s (01:30), completed Apr 28, 2020, 2:41:12 PM
  • Three cases were added to ./scalafmt-tests/src/test/resources/test/StripMargin.stat to
    ensure that potentially troublesome stripMargin() pipe characters, such as Dollar Sign,
    Backslash, and Right Parenthesis work as intended.

    Test cases for Unicode characters in leadingAsteriskSpace for trailing whitespace
    remain as an exercise for the reader. The lack of such tests is a blemish on my
    engineering.

  • I did not test it but the \n approach should work on the/most Windows operating system.
    It is what has been in the code for years.

  • The regular expressions in this PR should work with Scala Native and make an eventual
    port easier.

Documentation:

  • None required. The changes in this PR should cause no user visible
    change. I do not know if the practice of scalafmt development is to create
    a change log entry for user invisible changes or if the git history is enough.

Testing:

  • Built from a clean slate and manually tested ("test/test")
    using sbt 1.3.10 & Java 11 on X86_64 only. All tests pass.

  * This PR was motivated by recent discussions related to Scalafmt
    PRs scalameta#1916, scalameta#1917, and those mentioned in those PRs.

  * The primary purpose of this PR is to provide a concrete example
    of how the regular expressions for `leadingAsteriskSpace` and
    `compileStripMarginPattern` be simplified.

    Discussions about readability are always subjective. I believe
    that removing the positive and negative lookaheads makes it
    more likely that the next maintainer will be able to read and
    comprehend the code.

    Discussions about runtime performance improvements without
    demonstrable data are worth than useless.  lookaheads are
    known to have a high runtime cost. It is believable, but not
    asserted by this PR for lack of evidence, that the code of this
    PR will have better performance characteristics.

  * Whilst simplifying the two regex expressions, I noticed what I
    believe to be a bug in `compileStripMarginPattern`. Scala allows
    any character other than NULL to be used as a pipe character.
    Java requires that Dollar ($) and backslash (\) characters
    in the replacement string be quoted (e.g. in a string: "\\$").
    The code prior to this PR did not do this.

  * The code of this PR passes all current scalafmt tests:
    `sbt> tests/test`
    ```
[info] ScalaTest
[info] Run completed in 1 minute, 23 seconds.
[info] Total number of tests run: 2406
[info] Suites: completed 18, aborted 0
[info] Tests: succeeded 2406, failed 0, canceled 0, ignored 8, pending 0
[info] All tests passed.
[info] Passed: Total 2406, Failed 0, Errors 0, Passed 2406, Ignored 8
[success] Total time: 90 s (01:30), completed Apr 28, 2020, 2:41:12 PM
    ```

  * Because of other pressing demands on my time, I did not craft a
    unit test before fixing it. If the basic changes of this PR are
    acceptable, I can study
    `./scalafmt-tests/src/test/resources/test/StripMargin.stat` to see
    if I can add cases for Dollar and Backslash.

Documentation:

  * None required. The changes in this PR should cause no user visible
    change

Testing:

  * Built from a clean slate and manually tested ("test/test")
    using sbt 1.3.10 & Java 11 on X86_64 only. All tests pass.
@LeeTibbert LeeTibbert changed the title WIP: internal/FormatWriter simplify two regexen fix pipe char bug WIP: internal/FormatWriter simplify two regexen; fix pipe char bug Apr 28, 2020
@@ -735,7 +737,7 @@ object FormatWriter {
}

private val leadingAsteriskSpace =
Pattern.compile("(?<=\\n)\\h*(?=[*][^*])")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why allowing match on multiple asterisks is equivalent? I tried yesterday, there was a test which failed but looks like for you it didn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the requirement that " *X" gets replaced and " **" (double asterisk) does not as
"Do not even think about breaking this" lore. The match in the capture group
should match anything but what would be a second asterisk [^*].

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, was looking at the code on my phone, didn't see the [^*]. never mind.

\n vs \\n, any difference?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\n vs \\n, any difference?

thanks for adding tests. this question above is the last one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: \n vs \\n, any difference?

Pardon me for being concrete, no disrespect intended. The concrete answer
is more of my inability to organize my thoughts above that. I have spent
hours, days, or, more probably weeks, being hosed by exactly this issue.
The recurrence interval between episodes of my being hosed is just slightly
longer than my memory of the solution. Mea culpa!

It is my understanding that for 'normal' strings \n is always used and gets translated
to one standard Java two-octet Character.

In s-interpolated strings (and possibly/probably other interpolations) things get
a bit more confusing, at least for me. I believe that a \n gets translated as for
a simple string. In \\n the first backslash quotes the second backslash, which
then becomes \n which Pattern.compile interprets as though the original
string had been \n.

So, both work in interpolated strings, unless you have evidence or experience to
the contrary. I tend to prefer the single \n because it has fewer characters and transformations to parse.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: \n vs \\n, any difference?

So, both work in interpolated strings, unless you have evidence or experience to
the contrary. I tend to prefer the single \n because it has fewer characters and transformations to parse.

like \\s, \\h or other special escape sequences, i was curious whether \\n is interpreted by Pattern as anything different than simple 0xA. clearly, \n is directly substituted as 0xA so pattern only sees that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: tests

I found StripMargin.stat and added tests for only the align aspect. I did not address
the no align case, which is a weasel, driven by limited time.

I had expected to find tests for leading and trailing spaces. I could not find anything relevant
after a number of trials: using find for filenames and/or grepping for "whitespace" and
combinations.

I did find a Unicode.stat and could slip in a case for leading and trailing horizontal
whitespace containing Unicode there. It would seem more integrated to have
it in a leadingWhitespace.spec if such a thing exists. Would, IMHO, make it easier
for future visiting devos, such as myself, to find.

Off the top of your head, do you have any clues or suggestions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off the top of your head, do you have any clues or suggestions.

I added one tab in the default/Comment test, see at the bottom:

https://github.com/scalameta/scalafmt/pull/1911/files

you can create a dedicated whitespace test but i considered it part of handling comments and strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kitbellew The info you gave helped me to get to the next round. Thank you.

All tests, new & previous, pass on my system. I will check CI later.

The tests are still sensitive to some future devo removing the now invisible whitespace
especially the tabs.

I think the three regex expressions are as simple & fast as the requirements and reasonable
end use allows. I left a comment in FormatWriter.scala explaining why. Figured that might
endure long after the present conversation is forgotten.

I suspect that there are a couple of bugs remaining in scalafmt relating to tabs. I will see in
my mainline daily work if I can trick them from coming out of the dim shadows.

Unit tests prove their worth, much as they are a pain to write. The hard time I had
getting unit tests to accept Ogham marks and unicode \uMumble convinced me
that a simple "space or tab" test is all that is needed. There is nothing wrong with
"\h", just that it will spend a lot of time testing against characters which can not be
in the input. Sorry if I lead you down a loosing path.

If you like the current work, I can update & buff the commit message and change the
title.

Copy link
Collaborator

@kitbellew kitbellew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. when you are ready, perhaps remove the WIP indicator, and let's have @poslegm review as well, he reviewed the previous changes.

@@ -125,7 +125,7 @@ object a {
/*

/**
* comment
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you remove this because you create a separate test case for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. The case seemed to be multiplexed in a non-obvious (to me way).
I de-multiplexed the concerns.

Creating a second test with a title which described its primary (and sole) concerned
seemed to be a gift to future people trying to suss this stuff.
The added complexity of a second test seemed worth the benefit.

val a = 1
val b = 2
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you mind remove the 3 blank lines at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thank you. So you have something against trailing whitespace ;-)
Guess I had expected scalafmt to remove those for me. Must not be set in
the project .scalafmt.conf or I bungled it.

* and tab (\u0009) are horizontal whitespace.
* Save execution time by not checking what the character could not be.
*
* */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be */ or **/ instead of * */?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being a visitor and it not being my usual coding style, for consistency I did a
"monkey see, monkey do" of several other comments in the same file.

Yes, the trailing "* */" is not a variant I have seen in recent memory,

That comment is re-worked in the commit I am about to submit, using my
usual "//" style. If the concept of the comment is good, I can adapt the
comment style to anything you suggest. Small dragons.

* https://github.com/scala/scala/blob/2.11.x/spec/01-lexical-syntax.md
* gives the recognized scala whitespace characters.
* "Whitespace characters. \U0020 | \U0009 | \U000D | \U000A"
* (original text has lowercase u. U substituted here to allow scalafmt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment relevant to the code? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought so, but that section is superseded now. After I commit, let me know if
I can improve the new text. I try to get enough across so that I can remember six months
down the road and so that a colleague can read it without complaining about "War & Peace".

private val trailingSpace = Pattern.compile("\\h+$", Pattern.MULTILINE)
/**
* No Unicode is necessary in the three (3) Pattern.compile() below.
* scalafmt parser will have already have objected.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you saying that nothing other than space and tab are allowed by scalac or scalameta parser that precedes formatting?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be allowed according to specs.

A cursory examination of the scalameta sources led to this. So, it matches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After another long day of creating test cases and trying to be perverse, it turns out
that trailing whitespace with Unicode whitespace is allowed in comments. As cited,
it is rejected in non-comment code.

I am about to do a commit which handles, tests, & succinctly (I hope) documents.

The discussion here helps.

@kitbellew
Copy link
Collaborator

@LeeTibbert will you be able to update this soon? we are planning to release v2.5.0 this week.

@poslegm can you please review? this is a follow-up to #1911, #1916 and #1917.

@kitbellew kitbellew requested a review from poslegm April 30, 2020 15:58
@poslegm
Copy link
Collaborator

poslegm commented Apr 30, 2020

@kitbellew I am ok with it. Also I think no one will be hurt if this goes into the 2.5.1 instead of 2.5.0 :) So no need to rush

@LeeTibbert
Copy link
Contributor Author

If it does not cause unnecessary work at your end, I would be happier getting on the
2.5.1 train. I have appropriate confidence in the code, but always hate & fear
'just one more last minute change'.

Right now I just need to run scalafmt on my code and resolve some conflicts.

@kitbellew
Copy link
Collaborator

If it does not cause unnecessary work at your end, I would be happier getting on the
2.5.1 train. I have appropriate confidence in the code, but always hate & fear
'just one more last minute change'.

Right now I just need to run scalafmt on my code and resolve some conflicts.

Thank you for your work, patience with my (hopefully minor) nitpicking, and let's target 2.5.1.

LeeTibbert added a commit to LeeTibbert/scalafmt that referenced this pull request Apr 30, 2020
  * The primary purpose of this PR is simplify two regular expressions
    (regexen). One used by `leadingAsteriskSpace` and the other by
    `compileStripMarginPattern`.

    Discussions about readability are always subjective. I believe
    that removing the positive and negative lookaheads makes it
    more likely that the next maintainer will be able to read and
    comprehend the code.

    Discussions about runtime performance improvements without
    demonstrable data are worth than useless.  lookaheads are
    known to have a high runtime cost. It is believable, but not
    asserted by this PR for lack of evidence, that the code of this
    PR will have better performance characteristics.

 *  Unicode whitespace handling was added, along with test cases.

 * This PR supersedes the Work In Progress PR scalameta#1924. It has
   benefited greatly from discussions and multiple rounds of review there.
   My thanks and appreciation to the patient reviewers.

  * Whilst simplifying the two regex expressions, I noticed what I
    believe to be a bug in `compileStripMarginPattern`. Scala allows
    any character other than NULL to be used as a pipe character.
    Java requires that Dollar ($) and backslash (\) characters
    in the replacement string be quoted (e.g. in a string: "\\$").
    The code prior to this PR did not do this.

 * Previous regexen precluded Scala Native support. The ones of this
   PR do not. Scala Native does not yet support "\h" but it is
   believable that such support could be added.

 * The handling of `compileStripMarginPattern` is a bit tricky.
    Every/any regex expression should be sealed with the Curse of the Mummy,
    but that one is particularly deserving.

     The active pipe character is not in scope where the replaceAll() is done.
     Using a capture group allows the pipe character from where the pattern
     is compiled to be used in replaceAll(),

 * I did not test it but the `\n` approach should work on the/most Windows
   operating system. It is what has been in the scalafmt code for years.

Documentation:

  * None required. The changes in this PR should cause no user visible
    change. I do not know if the practice of scalafmt development is to create
    a change log entry for user invisible changes or
    if the git history is enough.

Testing:

  * Built from a clean slate and manually tested ("test/test")
    using sbt 1.3.10 & Java 8 on X86_64 only. All tests pass.
@LeeTibbert
Copy link
Contributor Author

@kitbellew I have appreciated your timely & patient review. Regular expressions are
supposed to save time & complexity but I find they really suck up time. This PR has
benefited from the review. I am sorry that it took so many cycles.

@LeeTibbert
Copy link
Contributor Author

Superseded by PR #1936, its higher & purer form.

@LeeTibbert LeeTibbert closed this May 1, 2020
kitbellew pushed a commit that referenced this pull request May 1, 2020
  * The primary purpose of this PR is simplify two regular expressions
    (regexen). One used by `leadingAsteriskSpace` and the other by
    `compileStripMarginPattern`.

    Discussions about readability are always subjective. I believe
    that removing the positive and negative lookaheads makes it
    more likely that the next maintainer will be able to read and
    comprehend the code.

    Discussions about runtime performance improvements without
    demonstrable data are worth than useless.  lookaheads are
    known to have a high runtime cost. It is believable, but not
    asserted by this PR for lack of evidence, that the code of this
    PR will have better performance characteristics.

 *  Unicode whitespace handling was added, along with test cases.

 * This PR supersedes the Work In Progress PR #1924. It has
   benefited greatly from discussions and multiple rounds of review there.
   My thanks and appreciation to the patient reviewers.

  * Whilst simplifying the two regex expressions, I noticed what I
    believe to be a bug in `compileStripMarginPattern`. Scala allows
    any character other than NULL to be used as a pipe character.
    Java requires that Dollar ($) and backslash (\) characters
    in the replacement string be quoted (e.g. in a string: "\\$").
    The code prior to this PR did not do this.

 * Previous regexen precluded Scala Native support. The ones of this
   PR do not. Scala Native does not yet support "\h" but it is
   believable that such support could be added.

 * The handling of `compileStripMarginPattern` is a bit tricky.
    Every/any regex expression should be sealed with the Curse of the Mummy,
    but that one is particularly deserving.

     The active pipe character is not in scope where the replaceAll() is done.
     Using a capture group allows the pipe character from where the pattern
     is compiled to be used in replaceAll(),

 * I did not test it but the `\n` approach should work on the/most Windows
   operating system. It is what has been in the scalafmt code for years.

Documentation:

  * None required. The changes in this PR should cause no user visible
    change. I do not know if the practice of scalafmt development is to create
    a change log entry for user invisible changes or
    if the git history is enough.

Testing:

  * Built from a clean slate and manually tested ("test/test")
    using sbt 1.3.10 & Java 8 on X86_64 only. All tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants