-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex is incorrectly handling casing of some ranges #36149
Comments
Tagging subscribers to this area: @eerhardt |
Just adding some notes here: The range I have some questions (@GrabYourPitchforks or @tarekgh perhaps?) before I can put up a fix:
A fix such as checking if ( |
You'll occasionally see symbols and other non-alpha characters sitting in the middle of character blocks. You'll also see some ranges where the uppercase and lowercase variants aren't separated by 32. For example, Ģ ( Best thing to do might be to case-map each code point independently, then see if you can generate any ranges from those maps. For example, consider that you're given the regex "allow 100 - 109, case-insensitive." And the lowercase mappings are as follows: ; examples only
100 -> 132
101 -> 133
102 -> 134
103 -> 103 ; symbol, not an alpha character
104 -> 135
105 -> 136
106 -> 137
107 -> 138
108 -> 139
109 -> 140 Then ideally we'd add the ranges Also worth pointing out: |
BTW, here are the two resources I use for getting information on Unicode code points. https://unicode.org/cldr/utility/character.jsp is the authoritative source. It also has click-through navigation, so you can ask for things like "show me all code points which, when uppercased, convert to the code point I'm looking at right now." To use it, either paste the character itself into the text box at the top of the page, or write it as a hex-formatted number (padded to at least 4 digits, no 0x prefix). Example: https://www.fileformat.info/info/unicode/char/0000/index.htm is also a useful resource. It will show you the UTF-8 or UTF-16 encodings of the scalar value, and it will also show you the best attempt at rendering the glyph, even if you don't have an appropriate font pack installed locally. To use it, replace the "0000" in the URL above with the hex-formatted number for your code point (padded to at least 4 digits, no 0x prefix). Example: /00D7/index.htm or /12345/index.htm. |
Yup |
I have validated that #67184 fixes this |
The above two patterns should be identical, with on character set containing \xD7 and \xD8 and the other containing the range from \xD7 through \xD8 (which is just \xD7 and \xD8, since there's nothing in between them).
However, the first correctly prints
false
whereas the second incorrectly printstrue
.The implementation handles casing by creating a character class that's the lowercased version of the original. That means that for individual characters, it just adds the lowercase character:
runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Lines 554 to 555 in fd82afe
and for ranges it needs to add the lowercase character for each character in the range:
runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Line 559 in fd82afe
In the first case above, it follows the first path, adding in the ToLower(\xD7) (which is just \xD7) and the ToLower(\xD8) (which is \xF8).
In the second case, however, it follows the second path, and ends up incorrectly adding a range from \xF7 through \xF8.
As a result, the second case ends up incorrectly matching \xF7.
cc: @eerhardt, @pgovind
The text was updated successfully, but these errors were encountered: