WIP: internal/FormatWriter simplify two regexen; fix pipe char bug #1924

LeeTibbert · 2020-04-28T19:41:05Z

This PR was motivated by recent discussions related to Scalafmt
PRs FormatWriter: use \h for horizontal whitespace #1916, FormatWriter bugfix: add margin in the middle only #1917, and those mentioned in those PRs.
The primary purpose of this PR is to provide a concrete example
of how the regular expressions for leadingAsteriskSpace and
compileStripMarginPattern be simplified.

Discussions about readability are always subjective. I believe
that removing the positive and negative lookaheads makes it
more likely that the next maintainer will be able to read and
comprehend the code.

Discussions about runtime performance improvements without
demonstrable data are worth than useless. lookaheads are
known to have a high runtime cost. It is believable, but not
asserted by this PR for lack of evidence, that the code of this
PR will have better performance characteristics.
Whilst simplifying the two regex expressions, I noticed what I
believe to be a bug in compileStripMarginPattern. Scala allows
any character other than NULL to be used as a pipe character.
Java requires that Dollar ($) and backslash () characters
in the replacement string be quoted (e.g. in a string: "\$").
The code prior to this PR did not do this.
The handling of compileStripMarginPattern is a bit tricky. Every/any regex expression should
be sealed with the Curse of the Mummy, but that one is particularly deserving.

The active pipe character is not in scope where the replaceAll() is done.
Using a capture group allows the pipe character from where the pattern is compiled to be
used in replaceAll(),

Agreed, a code smell, but, as a guest, I wanted/needed to minimize changes to the
existing code. One capture group is not all that expensive. The "[^*]" character class
in leadingAsteriskSpace probably takes more memory.
The code of this PR passes all current scalafmt tests:
sbt> tests/test

[info] ScalaTest
[info] Run completed in 1 minute, 23 seconds.
[info] Total number of tests run: 2406
[info] Suites: completed 18, aborted 0
[info] Tests: succeeded 2406, failed 0, canceled 0, ignored 8, pending 0
[info] All tests passed.
[info] Passed: Total 2406, Failed 0, Errors 0, Passed 2406, Ignored 8
[success] Total time: 90 s (01:30), completed Apr 28, 2020, 2:41:12 PM

Three cases were added to ./scalafmt-tests/src/test/resources/test/StripMargin.stat to
ensure that potentially troublesome stripMargin() pipe characters, such as Dollar Sign,
Backslash, and Right Parenthesis work as intended.

Test cases for Unicode characters in leadingAsteriskSpace for trailing whitespace
remain as an exercise for the reader. The lack of such tests is a blemish on my
engineering.
I did not test it but the \n approach should work on the/most Windows operating system.
It is what has been in the code for years.
The regular expressions in this PR should work with Scala Native and make an eventual
port easier.

Documentation:

None required. The changes in this PR should cause no user visible
change. I do not know if the practice of scalafmt development is to create
a change log entry for user invisible changes or if the git history is enough.

Testing:

Built from a clean slate and manually tested ("test/test")
using sbt 1.3.10 & Java 11 on X86_64 only. All tests pass.

* This PR was motivated by recent discussions related to Scalafmt PRs scalameta#1916, scalameta#1917, and those mentioned in those PRs. * The primary purpose of this PR is to provide a concrete example of how the regular expressions for `leadingAsteriskSpace` and `compileStripMarginPattern` be simplified. Discussions about readability are always subjective. I believe that removing the positive and negative lookaheads makes it more likely that the next maintainer will be able to read and comprehend the code. Discussions about runtime performance improvements without demonstrable data are worth than useless. lookaheads are known to have a high runtime cost. It is believable, but not asserted by this PR for lack of evidence, that the code of this PR will have better performance characteristics. * Whilst simplifying the two regex expressions, I noticed what I believe to be a bug in `compileStripMarginPattern`. Scala allows any character other than NULL to be used as a pipe character. Java requires that Dollar ($) and backslash (\) characters in the replacement string be quoted (e.g. in a string: "\\$"). The code prior to this PR did not do this. * The code of this PR passes all current scalafmt tests: `sbt> tests/test` ``` [info] ScalaTest [info] Run completed in 1 minute, 23 seconds. [info] Total number of tests run: 2406 [info] Suites: completed 18, aborted 0 [info] Tests: succeeded 2406, failed 0, canceled 0, ignored 8, pending 0 [info] All tests passed. [info] Passed: Total 2406, Failed 0, Errors 0, Passed 2406, Ignored 8 [success] Total time: 90 s (01:30), completed Apr 28, 2020, 2:41:12 PM ``` * Because of other pressing demands on my time, I did not craft a unit test before fixing it. If the basic changes of this PR are acceptable, I can study `./scalafmt-tests/src/test/resources/test/StripMargin.stat` to see if I can add cases for Dollar and Backslash. Documentation: * None required. The changes in this PR should cause no user visible change Testing: * Built from a clean slate and manually tested ("test/test") using sbt 1.3.10 & Java 11 on X86_64 only. All tests pass.

scalafmt-core/shared/src/main/scala/org/scalafmt/internal/FormatWriter.scala

kitbellew · 2020-04-28T20:00:53Z

scalafmt-core/shared/src/main/scala/org/scalafmt/internal/FormatWriter.scala

@@ -735,7 +737,7 @@ object FormatWriter {
  }

  private val leadingAsteriskSpace =
-    Pattern.compile("(?<=\\n)\\h*(?=[*][^*])")


Can you explain why allowing match on multiple asterisks is equivalent? I tried yesterday, there was a test which failed but looks like for you it didn't.

I took the requirement that " *X" gets replaced and " **" (double asterisk) does not as
"Do not even think about breaking this" lore. The match in the capture group
should match anything but what would be a second asterisk [^*].

ah, was looking at the code on my phone, didn't see the [^*]. never mind.

\n vs \\n, any difference?

\n vs \\n, any difference?

thanks for adding tests. this question above is the last one.

re: \n vs \\n, any difference?

Pardon me for being concrete, no disrespect intended. The concrete answer
is more of my inability to organize my thoughts above that. I have spent
hours, days, or, more probably weeks, being hosed by exactly this issue.
The recurrence interval between episodes of my being hosed is just slightly
longer than my memory of the solution. Mea culpa!

It is my understanding that for 'normal' strings \n is always used and gets translated
to one standard Java two-octet Character.

In s-interpolated strings (and possibly/probably other interpolations) things get
a bit more confusing, at least for me. I believe that a \n gets translated as for
a simple string. In \\n the first backslash quotes the second backslash, which
then becomes \n which Pattern.compile interprets as though the original
string had been \n.

So, both work in interpolated strings, unless you have evidence or experience to
the contrary. I tend to prefer the single \n because it has fewer characters and transformations to parse.

re: \n vs \\n, any difference?

So, both work in interpolated strings, unless you have evidence or experience to
the contrary. I tend to prefer the single \n because it has fewer characters and transformations to parse.

like \\s, \\h or other special escape sequences, i was curious whether \\n is interpreted by Pattern as anything different than simple 0xA. clearly, \n is directly substituted as 0xA so pattern only sees that.

Re: tests

I found StripMargin.stat and added tests for only the align aspect. I did not address
the no align case, which is a weasel, driven by limited time.

I had expected to find tests for leading and trailing spaces. I could not find anything relevant
after a number of trials: using find for filenames and/or grepping for "whitespace" and
combinations.

I did find a Unicode.stat and could slip in a case for leading and trailing horizontal
whitespace containing Unicode there. It would seem more integrated to have
it in a leadingWhitespace.spec if such a thing exists. Would, IMHO, make it easier
for future visiting devos, such as myself, to find.

Off the top of your head, do you have any clues or suggestions.

Off the top of your head, do you have any clues or suggestions.

I added one tab in the default/Comment test, see at the bottom:

https://github.com/scalameta/scalafmt/pull/1911/files

you can create a dedicated whitespace test but i considered it part of handling comments and strings.

@kitbellew The info you gave helped me to get to the next round. Thank you.

All tests, new & previous, pass on my system. I will check CI later.

The tests are still sensitive to some future devo removing the now invisible whitespace
especially the tabs.

I think the three regex expressions are as simple & fast as the requirements and reasonable
end use allows. I left a comment in FormatWriter.scala explaining why. Figured that might
endure long after the present conversation is forgotten.

I suspect that there are a couple of bugs remaining in scalafmt relating to tabs. I will see in
my mainline daily work if I can trick them from coming out of the dim shadows.

Unit tests prove their worth, much as they are a pain to write. The hard time I had
getting unit tests to accept Ogham marks and unicode \uMumble convinced me
that a simple "space or tab" test is all that is needed. There is nothing wrong with
"\h", just that it will spend a lot of time testing against characters which can not be
in the input. Sorry if I lead you down a loosing path.

If you like the current work, I can update & buff the commit message and change the
title.

scalafmt-core/shared/src/main/scala/org/scalafmt/internal/FormatWriter.scala

kitbellew

looks good. when you are ready, perhaps remove the WIP indicator, and let's have @poslegm review as well, he reviewed the previous changes.

kitbellew · 2020-04-29T22:08:43Z

scalafmt-tests/src/test/resources/default/Comment.stat

@@ -125,7 +125,7 @@ object a {
  /*

   /**
-     		* comment


did you remove this because you create a separate test case for that?

Exactly. The case seemed to be multiplexed in a non-obvious (to me way).
I de-multiplexed the concerns.

Creating a second test with a title which described its primary (and sole) concerned
seemed to be a gift to future people trying to suss this stuff.
The added complexity of a second test seemed worth the benefit.

kitbellew · 2020-04-29T22:09:02Z

scalafmt-tests/src/test/resources/spaces/TrailingWhiteSpace.stat

+  val a = 1
+  val b = 2
+}
+


would you mind remove the 3 blank lines at the end?

Done, thank you. So you have something against trailing whitespace ;-)
Guess I had expected scalafmt to remove those for me. Must not be set in
the project .scalafmt.conf or I bungled it.

kitbellew · 2020-04-29T22:09:56Z

scalafmt-core/shared/src/main/scala/org/scalafmt/internal/FormatWriter.scala

+    * and tab (\u0009) are horizontal whitespace.
+    * Save execution time by not checking what the character could not be.
+    *
+    * */


should this be */ or **/ instead of * */?

Being a visitor and it not being my usual coding style, for consistency I did a
"monkey see, monkey do" of several other comments in the same file.

Yes, the trailing "* */" is not a variant I have seen in recent memory,

That comment is re-worked in the commit I am about to submit, using my
usual "//" style. If the concept of the comment is good, I can adapt the
comment style to anything you suggest. Small dragons.

kitbellew · 2020-04-29T22:11:19Z

scalafmt-core/shared/src/main/scala/org/scalafmt/internal/FormatWriter.scala

+    * https://github.com/scala/scala/blob/2.11.x/spec/01-lexical-syntax.md
+    * gives the recognized scala whitespace characters.
+    * "Whitespace characters. \U0020 | \U0009 | \U000D | \U000A"
+    * (original text has lowercase u. U substituted here to allow scalafmt


is this comment relevant to the code? :)

I had thought so, but that section is superseded now. After I commit, let me know if
I can improve the new text. I try to get enough across so that I can remember six months
down the road and so that a colleague can read it without complaining about "War & Peace".

kitbellew · 2020-04-29T22:12:50Z

scalafmt-core/shared/src/main/scala/org/scalafmt/internal/FormatWriter.scala

-  private val trailingSpace = Pattern.compile("\\h+$", Pattern.MULTILINE)
+  /**
+    * No Unicode is necessary in the three (3) Pattern.compile() below.
+    * scalafmt parser will have already have objected.


are you saying that nothing other than space and tab are allowed by scalac or scalameta parser that precedes formatting?

It shouldn't be allowed according to specs.

A cursory examination of the scalameta sources led to this. So, it matches.

After another long day of creating test cases and trying to be perverse, it turns out
that trailing whitespace with Unicode whitespace is allowed in comments. As cited,
it is rejected in non-comment code.

I am about to do a commit which handles, tests, & succinctly (I hope) documents.

The discussion here helps.

kitbellew · 2020-04-30T15:58:30Z

@LeeTibbert will you be able to update this soon? we are planning to release v2.5.0 this week.

@poslegm can you please review? this is a follow-up to #1911, #1916 and #1917.

poslegm · 2020-04-30T16:54:34Z

@kitbellew I am ok with it. Also I think no one will be hurt if this goes into the 2.5.1 instead of 2.5.0 :) So no need to rush

LeeTibbert · 2020-04-30T20:34:52Z

If it does not cause unnecessary work at your end, I would be happier getting on the
2.5.1 train. I have appropriate confidence in the code, but always hate & fear
'just one more last minute change'.

Right now I just need to run scalafmt on my code and resolve some conflicts.

kitbellew · 2020-04-30T21:03:29Z

If it does not cause unnecessary work at your end, I would be happier getting on the
2.5.1 train. I have appropriate confidence in the code, but always hate & fear
'just one more last minute change'.

Right now I just need to run scalafmt on my code and resolve some conflicts.

Thank you for your work, patience with my (hopefully minor) nitpicking, and let's target 2.5.1.

* The primary purpose of this PR is simplify two regular expressions (regexen). One used by `leadingAsteriskSpace` and the other by `compileStripMarginPattern`. Discussions about readability are always subjective. I believe that removing the positive and negative lookaheads makes it more likely that the next maintainer will be able to read and comprehend the code. Discussions about runtime performance improvements without demonstrable data are worth than useless. lookaheads are known to have a high runtime cost. It is believable, but not asserted by this PR for lack of evidence, that the code of this PR will have better performance characteristics. * Unicode whitespace handling was added, along with test cases. * This PR supersedes the Work In Progress PR scalameta#1924. It has benefited greatly from discussions and multiple rounds of review there. My thanks and appreciation to the patient reviewers. * Whilst simplifying the two regex expressions, I noticed what I believe to be a bug in `compileStripMarginPattern`. Scala allows any character other than NULL to be used as a pipe character. Java requires that Dollar ($) and backslash (\) characters in the replacement string be quoted (e.g. in a string: "\\$"). The code prior to this PR did not do this. * Previous regexen precluded Scala Native support. The ones of this PR do not. Scala Native does not yet support "\h" but it is believable that such support could be added. * The handling of `compileStripMarginPattern` is a bit tricky. Every/any regex expression should be sealed with the Curse of the Mummy, but that one is particularly deserving. The active pipe character is not in scope where the replaceAll() is done. Using a capture group allows the pipe character from where the pattern is compiled to be used in replaceAll(), * I did not test it but the `\n` approach should work on the/most Windows operating system. It is what has been in the scalafmt code for years. Documentation: * None required. The changes in this PR should cause no user visible change. I do not know if the practice of scalafmt development is to create a change log entry for user invisible changes or if the git history is enough. Testing: * Built from a clean slate and manually tested ("test/test") using sbt 1.3.10 & Java 8 on X86_64 only. All tests pass.

LeeTibbert · 2020-05-01T00:05:35Z

@kitbellew I have appreciated your timely & patient review. Regular expressions are
supposed to save time & complexity but I find they really suck up time. This PR has
benefited from the review. I am sorry that it took so many cycles.

LeeTibbert · 2020-05-01T00:10:27Z

Superseded by PR #1936, its higher & purer form.

* The primary purpose of this PR is simplify two regular expressions (regexen). One used by `leadingAsteriskSpace` and the other by `compileStripMarginPattern`. Discussions about readability are always subjective. I believe that removing the positive and negative lookaheads makes it more likely that the next maintainer will be able to read and comprehend the code. Discussions about runtime performance improvements without demonstrable data are worth than useless. lookaheads are known to have a high runtime cost. It is believable, but not asserted by this PR for lack of evidence, that the code of this PR will have better performance characteristics. * Unicode whitespace handling was added, along with test cases. * This PR supersedes the Work In Progress PR #1924. It has benefited greatly from discussions and multiple rounds of review there. My thanks and appreciation to the patient reviewers. * Whilst simplifying the two regex expressions, I noticed what I believe to be a bug in `compileStripMarginPattern`. Scala allows any character other than NULL to be used as a pipe character. Java requires that Dollar ($) and backslash (\) characters in the replacement string be quoted (e.g. in a string: "\\$"). The code prior to this PR did not do this. * Previous regexen precluded Scala Native support. The ones of this PR do not. Scala Native does not yet support "\h" but it is believable that such support could be added. * The handling of `compileStripMarginPattern` is a bit tricky. Every/any regex expression should be sealed with the Curse of the Mummy, but that one is particularly deserving. The active pipe character is not in scope where the replaceAll() is done. Using a capture group allows the pipe character from where the pattern is compiled to be used in replaceAll(), * I did not test it but the `\n` approach should work on the/most Windows operating system. It is what has been in the scalafmt code for years. Documentation: * None required. The changes in this PR should cause no user visible change. I do not know if the practice of scalafmt development is to create a change log entry for user invisible changes or if the git history is enough. Testing: * Built from a clean slate and manually tested ("test/test") using sbt 1.3.10 & Java 8 on X86_64 only. All tests pass.

LeeTibbert changed the title ~~WIP: internal/FormatWriter simplify two regexen fix pipe char bug~~ WIP: internal/FormatWriter simplify two regexen; fix pipe char bug Apr 28, 2020

kitbellew reviewed Apr 28, 2020

View reviewed changes

Remove 2 unnecessary Pattern.MULTILINE; test for alternate pipe chars.

a2d666c

kitbellew reviewed Apr 29, 2020

View reviewed changes

Simplify regexen even more; work/re-work whitespace tests

4642b97

kitbellew requested changes Apr 29, 2020

View reviewed changes

kitbellew requested a review from poslegm April 30, 2020 15:58

LeeTibbert mentioned this pull request Apr 30, 2020

FormatWriter: simplify two regular expressions; fix pipe char bug #1936

Merged

LeeTibbert closed this May 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: internal/FormatWriter simplify two regexen; fix pipe char bug #1924

WIP: internal/FormatWriter simplify two regexen; fix pipe char bug #1924

LeeTibbert commented Apr 28, 2020 •

edited

Loading

kitbellew Apr 28, 2020

LeeTibbert Apr 28, 2020

kitbellew Apr 28, 2020

kitbellew Apr 29, 2020

LeeTibbert Apr 29, 2020

kitbellew Apr 29, 2020

LeeTibbert Apr 29, 2020

kitbellew Apr 29, 2020

LeeTibbert Apr 29, 2020

kitbellew left a comment

kitbellew Apr 29, 2020

LeeTibbert Apr 30, 2020

kitbellew Apr 29, 2020

LeeTibbert Apr 30, 2020

kitbellew Apr 29, 2020

LeeTibbert Apr 30, 2020

kitbellew Apr 29, 2020

LeeTibbert Apr 30, 2020

kitbellew Apr 29, 2020

poslegm Apr 30, 2020

LeeTibbert Apr 30, 2020

kitbellew commented Apr 30, 2020

poslegm commented Apr 30, 2020

LeeTibbert commented Apr 30, 2020

kitbellew commented Apr 30, 2020

LeeTibbert commented May 1, 2020

LeeTibbert commented May 1, 2020

WIP: internal/FormatWriter simplify two regexen; fix pipe char bug #1924

WIP: internal/FormatWriter simplify two regexen; fix pipe char bug #1924

Conversation

LeeTibbert commented Apr 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kitbellew left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kitbellew commented Apr 30, 2020

poslegm commented Apr 30, 2020

LeeTibbert commented Apr 30, 2020

kitbellew commented Apr 30, 2020

LeeTibbert commented May 1, 2020

LeeTibbert commented May 1, 2020

LeeTibbert commented Apr 28, 2020 •

edited

Loading