Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex is incorrectly handling casing of some ranges #36149

Closed
stephentoub opened this issue May 8, 2020 · 7 comments · Fixed by #42282 or #67184
Closed

Regex is incorrectly handling casing of some ranges #36149

stephentoub opened this issue May 8, 2020 · 7 comments · Fixed by #42282 or #67184

Comments

@stephentoub
Copy link
Member

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        Console.WriteLine(Regex.IsMatch("\xF7", @"^(?i:[\xD7\xD8])$", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant));
        Console.WriteLine(Regex.IsMatch("\xF7", @"^(?i:[\xD7-\xD8])$", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant));
    }
}

The above two patterns should be identical, with on character set containing \xD7 and \xD8 and the other containing the range from \xD7 through \xD8 (which is just \xD7 and \xD8, since there's nothing in between them).

However, the first correctly prints false whereas the second incorrectly prints true.

The implementation handles casing by creating a character class that's the lowercased version of the original. That means that for individual characters, it just adds the lowercase character:

char lower = culture.TextInfo.ToLower(range.First);
rangeList[i] = new SingleRange(lower, lower);

and for ranges it needs to add the lowercase character for each character in the range:

In the first case above, it follows the first path, adding in the ToLower(\xD7) (which is just \xD7) and the ToLower(\xD8) (which is \xF8).

In the second case, however, it follows the second path, and ends up incorrectly adding a range from \xF7 through \xF8.

As a result, the second case ends up incorrectly matching \xF7.

cc: @eerhardt, @pgovind

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label May 8, 2020
@ghost
Copy link

ghost commented May 8, 2020

Tagging subscribers to this area: @eerhardt
Notify danmosemsft if you want to be subscribed.

@stephentoub stephentoub removed the untriaged New issue has not been triaged by the area owner label Jun 28, 2020
@stephentoub stephentoub added this to the Future milestone Jun 28, 2020
@pgovind
Copy link

pgovind commented Sep 14, 2020

Just adding some notes here: The range /xD7 - /xD8 is 215 - 216 in decimal. char.ToLower(/xD7) is /xD7 i.e. 215. char.ToLower(/xD8) is however 248. Internally, we use a s_lcTable to map capital letters to small letters. However, the mapping logic in AddLowercaseRange assumes that the capital letters range.First and range.Last can be converted to range.First+32 and range.Last+32 respectively. Therefore, we end up adding the range 247 - 248 in the AddLowercaseRange method! (which is why we match /xF7 (which has the decimal value 247).

I have some questions (@GrabYourPitchforks or @tarekgh perhaps?) before I can put up a fix:

  1. When I look here, I see that /xD7 (decimal value of 215) does NOT appear in the list. Let's call this a "hole". Does this mean /xD7 is special somehow? Are there other such holes in the unicode spec? If so, all those cases will probably be bugs in our current implementation.
  2. If /xD7 has a special Unicode category, that would make it easy to put up a fix. If not, is there an easy way to determine how many "holes" exist given a start and end point?

A fix such as checking if (char.ToLower(something) == char.ToUpper(something)) might work, but I'd like to get more information before I test it.

@GrabYourPitchforks
Copy link
Member

U+00D7 is a symbol (see https://www.fileformat.info/info/unicode/char/00d7/index.htm), so it doesn't have an associated case.

You'll occasionally see symbols and other non-alpha characters sitting in the middle of character blocks. You'll also see some ranges where the uppercase and lowercase variants aren't separated by 32. For example, Ģ (U+0122) is the uppercase form of ģ (U+0123). That's a separation of only 1 code point, not 32 code points.

Best thing to do might be to case-map each code point independently, then see if you can generate any ranges from those maps. For example, consider that you're given the regex "allow 100 - 109, case-insensitive." And the lowercase mappings are as follows:

; examples only
100 -> 132
101 -> 133
102 -> 134
103 -> 103 ; symbol, not an alpha character
104 -> 135
105 -> 136
106 -> 137
107 -> 138
108 -> 139
109 -> 140

Then ideally we'd add the ranges [100 - 109], [132 - 134], and [135 - 140] (all inclusive) to the match list. With some additional logic we could compact overlapping or adjacent ranges.

Also worth pointing out: string.Contains(a, b, StringComparison.OrdinalIgnoreCase) normalizes to uppercase before performing the check. Has Regex.IsMatch(..., IgnoreCase) always normalized to lowercase? This could result in string.Contains and Regex.IsMatch returning different results for the same inputs. It might be worth investigating getting these two implementations in sync so as not to surprise callers.

@GrabYourPitchforks
Copy link
Member

BTW, here are the two resources I use for getting information on Unicode code points.

https://unicode.org/cldr/utility/character.jsp is the authoritative source. It also has click-through navigation, so you can ask for things like "show me all code points which, when uppercased, convert to the code point I'm looking at right now." To use it, either paste the character itself into the text box at the top of the page, or write it as a hex-formatted number (padded to at least 4 digits, no 0x prefix). Example: 00D7 or 12345.

https://www.fileformat.info/info/unicode/char/0000/index.htm is also a useful resource. It will show you the UTF-8 or UTF-16 encodings of the scalar value, and it will also show you the best attempt at rendering the glyph, even if you don't have an appropriate font pack installed locally. To use it, replace the "0000" in the URL above with the hex-formatted number for your code point (padded to at least 4 digits, no 0x prefix). Example: /00D7/index.htm or /12345/index.htm.

@stephentoub
Copy link
Member Author

Has Regex.IsMatch(..., IgnoreCase) always normalized to lowercase?

Yup

@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
@pgovind
Copy link

pgovind commented Sep 21, 2021

#42282 is being reverted in #59425. Re-opening this issue.

@pgovind pgovind reopened this Sep 21, 2021
@joperezr
Copy link
Member

joperezr commented Apr 5, 2022

I have validated that #67184 fixes this

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.