Regex is incorrectly handling casing of some ranges #36149

stephentoub · 2020-05-08T21:10:10Z

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        Console.WriteLine(Regex.IsMatch("\xF7", @"^(?i:[\xD7\xD8])$", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant));
        Console.WriteLine(Regex.IsMatch("\xF7", @"^(?i:[\xD7-\xD8])$", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant));
    }
}

The above two patterns should be identical, with on character set containing \xD7 and \xD8 and the other containing the range from \xD7 through \xD8 (which is just \xD7 and \xD8, since there's nothing in between them).

However, the first correctly prints false whereas the second incorrectly prints true.

The implementation handles casing by creating a character class that's the lowercased version of the original. That means that for individual characters, it just adds the lowercase character:

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

Lines 554 to 555 in fd82afe

    
           char lower = culture.TextInfo.ToLower(range.First); 
        
           rangeList[i] = new SingleRange(lower, lower);

and for ranges it needs to add the lowercase character for each character in the range:

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

Line 559 in fd82afe

AddLowercaseRange(range.First, range.Last);

In the first case above, it follows the first path, adding in the ToLower(\xD7) (which is just \xD7) and the ToLower(\xD8) (which is \xF8).

In the second case, however, it follows the second path, and ends up incorrectly adding a range from \xF7 through \xF8.

As a result, the second case ends up incorrectly matching \xF7.

cc: @eerhardt, @pgovind

The text was updated successfully, but these errors were encountered:

ghost · 2020-05-08T21:10:13Z

Tagging subscribers to this area: @eerhardt
Notify danmosemsft if you want to be subscribed.

pgovind · 2020-09-14T18:42:44Z

Just adding some notes here: The range /xD7 - /xD8 is 215 - 216 in decimal. char.ToLower(/xD7) is /xD7 i.e. 215. char.ToLower(/xD8) is however 248. Internally, we use a s_lcTable to map capital letters to small letters. However, the mapping logic in AddLowercaseRange assumes that the capital letters range.First and range.Last can be converted to range.First+32 and range.Last+32 respectively. Therefore, we end up adding the range 247 - 248 in the AddLowercaseRange method! (which is why we match /xF7 (which has the decimal value 247).

I have some questions (@GrabYourPitchforks or @tarekgh perhaps?) before I can put up a fix:

When I look here, I see that /xD7 (decimal value of 215) does NOT appear in the list. Let's call this a "hole". Does this mean /xD7 is special somehow? Are there other such holes in the unicode spec? If so, all those cases will probably be bugs in our current implementation.
If /xD7 has a special Unicode category, that would make it easy to put up a fix. If not, is there an easy way to determine how many "holes" exist given a start and end point?

A fix such as checking if (char.ToLower(something) == char.ToUpper(something)) might work, but I'd like to get more information before I test it.

GrabYourPitchforks · 2020-09-15T00:46:20Z

U+00D7 is a symbol (see https://www.fileformat.info/info/unicode/char/00d7/index.htm), so it doesn't have an associated case.

You'll occasionally see symbols and other non-alpha characters sitting in the middle of character blocks. You'll also see some ranges where the uppercase and lowercase variants aren't separated by 32. For example, Ģ (U+0122) is the uppercase form of ģ (U+0123). That's a separation of only 1 code point, not 32 code points.

Best thing to do might be to case-map each code point independently, then see if you can generate any ranges from those maps. For example, consider that you're given the regex "allow 100 - 109, case-insensitive." And the lowercase mappings are as follows:

; examples only
100 -> 132
101 -> 133
102 -> 134
103 -> 103 ; symbol, not an alpha character
104 -> 135
105 -> 136
106 -> 137
107 -> 138
108 -> 139
109 -> 140

Then ideally we'd add the ranges [100 - 109], [132 - 134], and [135 - 140] (all inclusive) to the match list. With some additional logic we could compact overlapping or adjacent ranges.

Also worth pointing out: string.Contains(a, b, StringComparison.OrdinalIgnoreCase) normalizes to uppercase before performing the check. Has Regex.IsMatch(..., IgnoreCase) always normalized to lowercase? This could result in string.Contains and Regex.IsMatch returning different results for the same inputs. It might be worth investigating getting these two implementations in sync so as not to surprise callers.

GrabYourPitchforks · 2020-09-15T00:53:19Z

BTW, here are the two resources I use for getting information on Unicode code points.

https://unicode.org/cldr/utility/character.jsp is the authoritative source. It also has click-through navigation, so you can ask for things like "show me all code points which, when uppercased, convert to the code point I'm looking at right now." To use it, either paste the character itself into the text box at the top of the page, or write it as a hex-formatted number (padded to at least 4 digits, no 0x prefix). Example: 00D7 or 12345.

https://www.fileformat.info/info/unicode/char/0000/index.htm is also a useful resource. It will show you the UTF-8 or UTF-16 encodings of the scalar value, and it will also show you the best attempt at rendering the glyph, even if you don't have an appropriate font pack installed locally. To use it, replace the "0000" in the URL above with the hex-formatted number for your code point (padded to at least 4 digits, no 0x prefix). Example: /00D7/index.htm or /12345/index.htm.

stephentoub · 2020-09-15T01:08:11Z

Has Regex.IsMatch(..., IgnoreCase) always normalized to lowercase?

Yup

pgovind · 2021-09-21T20:40:28Z

#42282 is being reverted in #59425. Re-opening this issue.

joperezr · 2022-04-05T23:08:21Z

I have validated that #67184 fixes this

stephentoub added the area-System.Text.RegularExpressions label May 8, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label May 8, 2020

stephentoub mentioned this issue May 8, 2020

Regex isn't factoring target culture into lowercasing of ranges #36147

Closed

stephentoub removed the untriaged New issue has not been triaged by the area owner label Jun 28, 2020

stephentoub added this to the Future milestone Jun 28, 2020

pgovind mentioned this issue Sep 15, 2020

Fix incorrect handling of character range and capitalization in regex #42282

Merged

pgovind closed this as completed in #42282 Nov 2, 2020

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

pgovind reopened this Sep 21, 2021

stephentoub closed this as completed in #67184 Apr 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex is incorrectly handling casing of some ranges #36149

Regex is incorrectly handling casing of some ranges #36149

stephentoub commented May 8, 2020

ghost commented May 8, 2020

pgovind commented Sep 14, 2020 •

edited

Loading

GrabYourPitchforks commented Sep 15, 2020

GrabYourPitchforks commented Sep 15, 2020

stephentoub commented Sep 15, 2020

pgovind commented Sep 21, 2021

joperezr commented Apr 5, 2022

Regex is incorrectly handling casing of some ranges #36149

Regex is incorrectly handling casing of some ranges #36149

Comments

stephentoub commented May 8, 2020

ghost commented May 8, 2020

pgovind commented Sep 14, 2020 • edited Loading

GrabYourPitchforks commented Sep 15, 2020

GrabYourPitchforks commented Sep 15, 2020

stephentoub commented Sep 15, 2020

pgovind commented Sep 21, 2021

joperezr commented Apr 5, 2022

pgovind commented Sep 14, 2020 •

edited

Loading