Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed word boundary patterns: ... #100

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mykeul
Copy link
Contributor

@mykeul mykeul commented Dec 5, 2019

... they were ascii-only while java.util.regexp is unicode compliant

@googlebot
Copy link
Collaborator

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@mykeul mykeul force-pushed the unicode-word-boundary branch from cf7871d to c6d8b3a Compare December 5, 2019 15:33
@mykeul
Copy link
Contributor Author

mykeul commented Dec 5, 2019

@googlebot I signed it!

@googlebot
Copy link
Collaborator

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: yes and removed cla: no labels Dec 5, 2019
@mykeul mykeul force-pushed the unicode-word-boundary branch from c6d8b3a to ba9f187 Compare December 5, 2019 16:45
@coveralls
Copy link

Pull Request Test Coverage Report for Build 281

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 15 unchanged lines in 4 files lost coverage.
  • Overall coverage decreased (-0.09%) to 93.957%

Files with Coverage Reduction New Missed Lines %
/home/travis/build/google/re2j/java/com/google/re2j/Regexp.java 1 94.58%
/home/travis/build/google/re2j/java/com/google/re2j/Simplify.java 1 93.75%
/home/travis/build/google/re2j/java/com/google/re2j/Unicode.java 6 84.31%
/home/travis/build/google/re2j/java/com/google/re2j/Parser.java 7 94.51%
Totals Coverage Status
Change from base Build 280: -0.09%
Covered Lines: 2985
Relevant Lines: 3177

💛 - Coveralls

@mykeul
Copy link
Contributor Author

mykeul commented Dec 5, 2019

Sorry, I had to @ignore a large test I don't know how to fix.

@mykeul mykeul force-pushed the unicode-word-boundary branch 2 times, most recently from d33b8c3 to 57a57c0 Compare June 10, 2020 13:35
@sjamesr
Copy link
Contributor

sjamesr commented Jun 22, 2020

RE2/J tries to implement RE2's syntax, which specifies that \b match ASCII word boundaries only. I think this behavior is working as intended.

@sjamesr sjamesr closed this Jun 22, 2020
@mykeul
Copy link
Contributor Author

mykeul commented Jun 15, 2021

I still do not understand why such PR making life easier for non-US devs is refused : this just makes transitions to re2j easier in many more use-cases, by reducing differences with jdk's Regexp.
For people like me who needs both #97 and not just "ascii regexps", here is the hack I'm using for more than one year (just need to call one of the static patch() method depending on the app context but asap, just with javassist in classpath) : https://gist.github.com/mykeul/2d69f72bcb73ff7a72f167a86329d0a8

@sjamesr
Copy link
Contributor

sjamesr commented Jun 15, 2021

I was clear as to why I refused the PR: because it caused RE2J to differ from RE2 in behavior. It has nothing to do with making life more difficult for non-US devs.

The question should be whether RE2J should try sticking to RE2 syntax and semantics, or whether it should favor similarity to java.util.regex when possible. I'm inclined to think that RE2J should behave more like java.util.regex, since that's what Java developers will most likely be comparing it to.

This PR probably breaks tests because the tests are derived from Go's regexp package, which implements RE2's ASCII-only word boundaries. I'll see what is necessary to fix the tests.

@sjamesr sjamesr reopened this Jun 15, 2021
@codecov-commenter
Copy link

codecov-commenter commented Jun 15, 2021

Codecov Report

Merging #100 (e49d00b) into master (b3bce7b) will decrease coverage by 0.22%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #100      +/-   ##
==========================================
- Coverage   89.25%   89.02%   -0.23%     
==========================================
  Files          19       19              
  Lines        3025     3026       +1     
  Branches      607      607              
==========================================
- Hits         2700     2694       -6     
- Misses        187      190       +3     
- Partials      138      142       +4     
Impacted Files Coverage Δ
java/com/google/re2j/Unicode.java 81.08% <100.00%> (+17.19%) ⬆️
java/com/google/re2j/Utils.java 83.33% <100.00%> (ø)
java/com/google/re2j/Simplify.java 84.37% <0.00%> (-6.25%) ⬇️
java/com/google/re2j/Regexp.java 90.64% <0.00%> (-0.99%) ⬇️
java/com/google/re2j/Parser.java 87.21% <0.00%> (-0.89%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3bce7b...e49d00b. Read the comment docs.

@mykeul
Copy link
Contributor Author

mykeul commented Jun 16, 2021

Hi James, many thanks,
(Sorry, I'm french, I'm not an exception to the world wide known "préjugé" : my english is bad, my terms may be inapropriated)
I got the point. Was just trying to share my experiments, my eventual fixes, and potential workarounds for this not accepted PR. Maybe RE2 would benefit from this too ?
I came to re2j for the potential longest matches that seemed impossible with java.util.regex, and proposed #97 and a few accepeted optimisations.
I need this gist in multiple projects now, just tried to share it, but really don't want to fork to an "unicode-re2j".
I'll try this/an evening (UTC+2) to merge current code into the initial PR to make an eventual merge easier.
Best regards,
Mickaël (please note the "ë" ;-) )

@google-cla
Copy link

google-cla bot commented Jun 17, 2021

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added cla: no and removed cla: yes labels Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants