-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: internal/FormatWriter simplify two regexen; fix pipe char bug #1924
WIP: internal/FormatWriter simplify two regexen; fix pipe char bug #1924
Conversation
* This PR was motivated by recent discussions related to Scalafmt PRs scalameta#1916, scalameta#1917, and those mentioned in those PRs. * The primary purpose of this PR is to provide a concrete example of how the regular expressions for `leadingAsteriskSpace` and `compileStripMarginPattern` be simplified. Discussions about readability are always subjective. I believe that removing the positive and negative lookaheads makes it more likely that the next maintainer will be able to read and comprehend the code. Discussions about runtime performance improvements without demonstrable data are worth than useless. lookaheads are known to have a high runtime cost. It is believable, but not asserted by this PR for lack of evidence, that the code of this PR will have better performance characteristics. * Whilst simplifying the two regex expressions, I noticed what I believe to be a bug in `compileStripMarginPattern`. Scala allows any character other than NULL to be used as a pipe character. Java requires that Dollar ($) and backslash (\) characters in the replacement string be quoted (e.g. in a string: "\\$"). The code prior to this PR did not do this. * The code of this PR passes all current scalafmt tests: `sbt> tests/test` ``` [info] ScalaTest [info] Run completed in 1 minute, 23 seconds. [info] Total number of tests run: 2406 [info] Suites: completed 18, aborted 0 [info] Tests: succeeded 2406, failed 0, canceled 0, ignored 8, pending 0 [info] All tests passed. [info] Passed: Total 2406, Failed 0, Errors 0, Passed 2406, Ignored 8 [success] Total time: 90 s (01:30), completed Apr 28, 2020, 2:41:12 PM ``` * Because of other pressing demands on my time, I did not craft a unit test before fixing it. If the basic changes of this PR are acceptable, I can study `./scalafmt-tests/src/test/resources/test/StripMargin.stat` to see if I can add cases for Dollar and Backslash. Documentation: * None required. The changes in this PR should cause no user visible change Testing: * Built from a clean slate and manually tested ("test/test") using sbt 1.3.10 & Java 11 on X86_64 only. All tests pass.
scalafmt-core/shared/src/main/scala/org/scalafmt/internal/FormatWriter.scala
Outdated
Show resolved
Hide resolved
@@ -735,7 +737,7 @@ object FormatWriter { | |||
} | |||
|
|||
private val leadingAsteriskSpace = | |||
Pattern.compile("(?<=\\n)\\h*(?=[*][^*])") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why allowing match on multiple asterisks is equivalent? I tried yesterday, there was a test which failed but looks like for you it didn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the requirement that " *X" gets replaced and " **" (double asterisk) does not as
"Do not even think about breaking this" lore. The match in the capture group
should match anything but what would be a second asterisk [^*].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, was looking at the code on my phone, didn't see the [^*]
. never mind.
\n
vs \\n
, any difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\n
vs\\n
, any difference?
thanks for adding tests. this question above is the last one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: \n vs \\n, any difference?
Pardon me for being concrete, no disrespect intended. The concrete answer
is more of my inability to organize my thoughts above that. I have spent
hours, days, or, more probably weeks, being hosed by exactly this issue.
The recurrence interval between episodes of my being hosed is just slightly
longer than my memory of the solution. Mea culpa!
It is my understanding that for 'normal' strings \n
is always used and gets translated
to one standard Java two-octet Character.
In s-interpolated strings (and possibly/probably other interpolations) things get
a bit more confusing, at least for me. I believe that a \n
gets translated as for
a simple string. In \\n
the first backslash quotes the second backslash, which
then becomes \n
which Pattern.compile interprets as though the original
string had been \n
.
So, both work in interpolated strings, unless you have evidence or experience to
the contrary. I tend to prefer the single \n
because it has fewer characters and transformations to parse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re:
\n vs \\n, any difference?
So, both work in interpolated strings, unless you have evidence or experience to
the contrary. I tend to prefer the single\n
because it has fewer characters and transformations to parse.
like \\s
, \\h
or other special escape sequences, i was curious whether \\n
is interpreted by Pattern
as anything different than simple 0xA
. clearly, \n
is directly substituted as 0xA so pattern only sees that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: tests
I found StripMargin.stat and added tests for only the align
aspect. I did not address
the no align
case, which is a weasel, driven by limited time.
I had expected to find tests for leading and trailing spaces. I could not find anything relevant
after a number of trials: using find for filenames and/or grepping for "whitespace" and
combinations.
I did find a Unicode.stat and could slip in a case for leading and trailing horizontal
whitespace containing Unicode there. It would seem more integrated to have
it in a leadingWhitespace.spec if such a thing exists. Would, IMHO, make it easier
for future visiting devos, such as myself, to find.
Off the top of your head, do you have any clues or suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Off the top of your head, do you have any clues or suggestions.
I added one tab in the default/Comment
test, see at the bottom:
https://github.com/scalameta/scalafmt/pull/1911/files
you can create a dedicated whitespace test but i considered it part of handling comments and strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kitbellew The info you gave helped me to get to the next round. Thank you.
All tests, new & previous, pass on my system. I will check CI later.
The tests are still sensitive to some future devo removing the now invisible whitespace
especially the tabs.
I think the three regex expressions are as simple & fast as the requirements and reasonable
end use allows. I left a comment in FormatWriter.scala explaining why. Figured that might
endure long after the present conversation is forgotten.
I suspect that there are a couple of bugs remaining in scalafmt relating to tabs. I will see in
my mainline daily work if I can trick them from coming out of the dim shadows.
Unit tests prove their worth, much as they are a pain to write. The hard time I had
getting unit tests to accept Ogham marks and unicode \uMumble convinced me
that a simple "space or tab" test is all that is needed. There is nothing wrong with
"\h", just that it will spend a lot of time testing against characters which can not be
in the input. Sorry if I lead you down a loosing path.
If you like the current work, I can update & buff the commit message and change the
title.
scalafmt-core/shared/src/main/scala/org/scalafmt/internal/FormatWriter.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. when you are ready, perhaps remove the WIP indicator, and let's have @poslegm review as well, he reviewed the previous changes.
@@ -125,7 +125,7 @@ object a { | |||
/* | |||
|
|||
/** | |||
* comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you remove this because you create a separate test case for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. The case seemed to be multiplexed in a non-obvious (to me way).
I de-multiplexed the concerns.
Creating a second test with a title which described its primary (and sole) concerned
seemed to be a gift to future people trying to suss this stuff.
The added complexity of a second test seemed worth the benefit.
val a = 1 | ||
val b = 2 | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you mind remove the 3 blank lines at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thank you. So you have something against trailing whitespace ;-)
Guess I had expected scalafmt to remove those for me. Must not be set in
the project .scalafmt.conf or I bungled it.
* and tab (\u0009) are horizontal whitespace. | ||
* Save execution time by not checking what the character could not be. | ||
* | ||
* */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be */
or **/
instead of * */
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being a visitor and it not being my usual coding style, for consistency I did a
"monkey see, monkey do" of several other comments in the same file.
Yes, the trailing "* */" is not a variant I have seen in recent memory,
That comment is re-worked in the commit I am about to submit, using my
usual "//" style. If the concept of the comment is good, I can adapt the
comment style to anything you suggest. Small dragons.
* https://github.com/scala/scala/blob/2.11.x/spec/01-lexical-syntax.md | ||
* gives the recognized scala whitespace characters. | ||
* "Whitespace characters. \U0020 | \U0009 | \U000D | \U000A" | ||
* (original text has lowercase u. U substituted here to allow scalafmt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment relevant to the code? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thought so, but that section is superseded now. After I commit, let me know if
I can improve the new text. I try to get enough across so that I can remember six months
down the road and so that a colleague can read it without complaining about "War & Peace".
private val trailingSpace = Pattern.compile("\\h+$", Pattern.MULTILINE) | ||
/** | ||
* No Unicode is necessary in the three (3) Pattern.compile() below. | ||
* scalafmt parser will have already have objected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you saying that nothing other than space and tab are allowed by scalac or scalameta parser that precedes formatting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be allowed according to specs.
A cursory examination of the scalameta sources led to this. So, it matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After another long day of creating test cases and trying to be perverse, it turns out
that trailing whitespace with Unicode whitespace is allowed in comments. As cited,
it is rejected in non-comment code.
I am about to do a commit which handles, tests, & succinctly (I hope) documents.
The discussion here helps.
@LeeTibbert will you be able to update this soon? we are planning to release v2.5.0 this week. @poslegm can you please review? this is a follow-up to #1911, #1916 and #1917. |
@kitbellew I am ok with it. Also I think no one will be hurt if this goes into the 2.5.1 instead of 2.5.0 :) So no need to rush |
If it does not cause unnecessary work at your end, I would be happier getting on the Right now I just need to run scalafmt on my code and resolve some conflicts. |
Thank you for your work, patience with my (hopefully minor) nitpicking, and let's target 2.5.1. |
* The primary purpose of this PR is simplify two regular expressions (regexen). One used by `leadingAsteriskSpace` and the other by `compileStripMarginPattern`. Discussions about readability are always subjective. I believe that removing the positive and negative lookaheads makes it more likely that the next maintainer will be able to read and comprehend the code. Discussions about runtime performance improvements without demonstrable data are worth than useless. lookaheads are known to have a high runtime cost. It is believable, but not asserted by this PR for lack of evidence, that the code of this PR will have better performance characteristics. * Unicode whitespace handling was added, along with test cases. * This PR supersedes the Work In Progress PR scalameta#1924. It has benefited greatly from discussions and multiple rounds of review there. My thanks and appreciation to the patient reviewers. * Whilst simplifying the two regex expressions, I noticed what I believe to be a bug in `compileStripMarginPattern`. Scala allows any character other than NULL to be used as a pipe character. Java requires that Dollar ($) and backslash (\) characters in the replacement string be quoted (e.g. in a string: "\\$"). The code prior to this PR did not do this. * Previous regexen precluded Scala Native support. The ones of this PR do not. Scala Native does not yet support "\h" but it is believable that such support could be added. * The handling of `compileStripMarginPattern` is a bit tricky. Every/any regex expression should be sealed with the Curse of the Mummy, but that one is particularly deserving. The active pipe character is not in scope where the replaceAll() is done. Using a capture group allows the pipe character from where the pattern is compiled to be used in replaceAll(), * I did not test it but the `\n` approach should work on the/most Windows operating system. It is what has been in the scalafmt code for years. Documentation: * None required. The changes in this PR should cause no user visible change. I do not know if the practice of scalafmt development is to create a change log entry for user invisible changes or if the git history is enough. Testing: * Built from a clean slate and manually tested ("test/test") using sbt 1.3.10 & Java 8 on X86_64 only. All tests pass.
@kitbellew I have appreciated your timely & patient review. Regular expressions are |
Superseded by PR #1936, its higher & purer form. |
* The primary purpose of this PR is simplify two regular expressions (regexen). One used by `leadingAsteriskSpace` and the other by `compileStripMarginPattern`. Discussions about readability are always subjective. I believe that removing the positive and negative lookaheads makes it more likely that the next maintainer will be able to read and comprehend the code. Discussions about runtime performance improvements without demonstrable data are worth than useless. lookaheads are known to have a high runtime cost. It is believable, but not asserted by this PR for lack of evidence, that the code of this PR will have better performance characteristics. * Unicode whitespace handling was added, along with test cases. * This PR supersedes the Work In Progress PR #1924. It has benefited greatly from discussions and multiple rounds of review there. My thanks and appreciation to the patient reviewers. * Whilst simplifying the two regex expressions, I noticed what I believe to be a bug in `compileStripMarginPattern`. Scala allows any character other than NULL to be used as a pipe character. Java requires that Dollar ($) and backslash (\) characters in the replacement string be quoted (e.g. in a string: "\\$"). The code prior to this PR did not do this. * Previous regexen precluded Scala Native support. The ones of this PR do not. Scala Native does not yet support "\h" but it is believable that such support could be added. * The handling of `compileStripMarginPattern` is a bit tricky. Every/any regex expression should be sealed with the Curse of the Mummy, but that one is particularly deserving. The active pipe character is not in scope where the replaceAll() is done. Using a capture group allows the pipe character from where the pattern is compiled to be used in replaceAll(), * I did not test it but the `\n` approach should work on the/most Windows operating system. It is what has been in the scalafmt code for years. Documentation: * None required. The changes in this PR should cause no user visible change. I do not know if the practice of scalafmt development is to create a change log entry for user invisible changes or if the git history is enough. Testing: * Built from a clean slate and manually tested ("test/test") using sbt 1.3.10 & Java 8 on X86_64 only. All tests pass.
This PR was motivated by recent discussions related to Scalafmt
PRs FormatWriter: use
\h
for horizontal whitespace #1916, FormatWriter bugfix: add margin in the middle only #1917, and those mentioned in those PRs.The primary purpose of this PR is to provide a concrete example
of how the regular expressions for
leadingAsteriskSpace
andcompileStripMarginPattern
be simplified.Discussions about readability are always subjective. I believe
that removing the positive and negative lookaheads makes it
more likely that the next maintainer will be able to read and
comprehend the code.
Discussions about runtime performance improvements without
demonstrable data are worth than useless. lookaheads are
known to have a high runtime cost. It is believable, but not
asserted by this PR for lack of evidence, that the code of this
PR will have better performance characteristics.
Whilst simplifying the two regex expressions, I noticed what I
believe to be a bug in
compileStripMarginPattern
. Scala allowsany character other than NULL to be used as a pipe character.
Java requires that Dollar ($) and backslash () characters
in the replacement string be quoted (e.g. in a string: "\$").
The code prior to this PR did not do this.
The handling of
compileStripMarginPattern
is a bit tricky. Every/any regex expression shouldbe sealed with the Curse of the Mummy, but that one is particularly deserving.
The active pipe character is not in scope where the replaceAll() is done.
Using a capture group allows the pipe character from where the pattern is compiled to be
used in replaceAll(),
Agreed, a code smell, but, as a guest, I wanted/needed to minimize changes to the
existing code. One capture group is not all that expensive. The "[^*]" character class
in
leadingAsteriskSpace
probably takes more memory.The code of this PR passes all current scalafmt tests:
sbt> tests/test
Three cases were added to
./scalafmt-tests/src/test/resources/test/StripMargin.stat
toensure that potentially troublesome stripMargin() pipe characters, such as Dollar Sign,
Backslash, and Right Parenthesis work as intended.
Test cases for Unicode characters in leadingAsteriskSpace for trailing whitespace
remain as an exercise for the reader. The lack of such tests is a blemish on my
engineering.
I did not test it but the
\n
approach should work on the/most Windows operating system.It is what has been in the code for years.
The regular expressions in this PR should work with Scala Native and make an eventual
port easier.
Documentation:
change. I do not know if the practice of scalafmt development is to create
a change log entry for user invisible changes or if the git history is enough.
Testing:
using sbt 1.3.10 & Java 11 on X86_64 only. All tests pass.