-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose IUtf8SpanParsable and implement it on the primitive numeric types #86875
Conversation
Note regarding the This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change. |
Tagging subscribers to this area: @dotnet/area-system-numerics Issue DetailsThis makes progress towards #81500 Still not covered are:
|
|
WASM Interpreter reports
|
CC. @stephentoub |
src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.Utf8.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.Utf8.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.Utf8.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.Utf8.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.Globalization.Utf8.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.Trim.Utf8.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Numerics/INumberBase.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.Utf8.cs
Show resolved
Hide resolved
99763f5
to
c7676cd
Compare
@@ -1023,7 +1023,7 @@ static bool TryParseRareTypes(RuntimeType rt, ReadOnlySpan<char> value, bool ign | |||
|
|||
if (throwOnFailure) | |||
{ | |||
Number.ThrowOverflowException(Type.GetTypeCode(typeof(TUnderlying))); | |||
ThrowHelper.ThrowOverflowException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 959 above was modified to read Number.ThrowOverflowException<TUnderlying>();
, but this line was modified to remove the type code / name entirely. Was this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underlying here was only hitting the "rare types" (float
, double
, nint
, nuint
, or char
). None of these were handled by Number.ThrowOverflowException(TypeCode)
and would have triggered asserts in debugging (since they aren't decimal) and using an incorrect string otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directly throwing OverflowException
without a string is both simpler and more accurate than what we had before.
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
internal static bool Vector128OrdinalIgnoreCaseAscii(Vector128<byte> vec1, Vector128<byte> vec2) | ||
{ | ||
// ASSUMPTION: Caller has validated that input values are ASCII. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible more optimized (completely untested!) implementation, based solely on codegen size.
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool Vector128OrdinalIgnoreCaseAscii(Vector128<byte> vec1, Vector128<byte> vec2)
{
// ASSUMPTION: Caller has validated that input values are ASCII.
Vector128<sbyte> vector0x20 = Vector128.Create((sbyte)0x20);
// Convert all characters to lowercase.
// Some non-letter chars will also be changed; we'll deal with them later.
Vector128<sbyte> asLower = vec1.AsSByte() | vector0x20;
// Create a vector where all letter characters [a-z] are normalized to 0x20
// and all non-letter characters are normalized to 0x00.
Vector128<sbyte> letterChars = Vector128.GreaterThan(asLower + Vector128.Create((sbyte)(0x7F - 'z')), Vector128.Create((sbyte)(0x7F - 26))) & vector0x20;
// Compute the ones-complement diff between vec1 and vec2.
Vector128<sbyte> onesCompDiff = vec1.AsSByte() ^ vec2.AsSByte();
// There must be no bits set in 'onesCompDiff' that aren't also set in 'letterChars'.
return Vector128.AndNot(letterChars, onesCompDiff) == default;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not going to worry about optimizing further at this point. Its more important that we get the feature in for .NET 8.
As is, this generally matches the algorithm used by the UTF-16 path and is correct.
An interested party can always submit this as an optimization after the main support goes in.
utf16Text = utf16TextArray.AsSpan(0, textMaxCharCount); | ||
} | ||
|
||
int utf16TextLength = Encoding.UTF8.GetChars(utf8Text, utf16Text); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential DoS vector here. This logic cannot safely coexist with approved API #87171. It's reasonable that somebody might write code like this:
while (!utf8Span.IsEmpty)
{
TNumber parsed = TNumber.Parse(utf8Span, NumberStyles.AllowTrailingInvalidCharacters, out int bytesConsumed);
UseValue(parsed);
utf8Span = utf8Span.Slice(bytesConsumed).TrimStart((byte)','); // or whatever the expected delimiter char is
}
The code above appears to have O(n)
runtime, where n is the length (in bytes) of the input span. However, since there's a transcoding step taking place under the covers, and since the transcoding step consumes the entire remainder of the input span, the actual runtime is O(n^2)
. Since aspnet allows a 4MB request by default, this means a loop like this can result in <some constant factor> * 16 trillion units of work being performed.
(Same comment elsewhere in this file where similar code appears.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to use Utf8.ToUtf16(utf8Text, utf16Text, out _, out int utf16TextLength, replaceInvalidSequences: false)
, as with other paths. It throws FormatException
on Parse
and returns false
for TryParse
, indicating the input was not in a correct format.
utf16Text = utf16TextArray.AsSpan(0, textMaxCharCount); | ||
} | ||
|
||
int utf16TextLength = Encoding.UTF8.GetChars(utf8Text, utf16Text); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential correctness issue here. What behavior should this routine have when the input is not well-formed UTF-8? The current behavior here is: lossily replace invalid UTF-8 sequences with well-formed UTF-16 replacement characters, then call the UTF-16 parse routine. Does this change the desired correctness of the parse routine? What about if the replacement character has special meaning to the specified provider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to use Utf8.ToUtf16(utf8Text, utf16Text, out _, out int utf16TextLength, replaceInvalidSequences: false)
, as with other paths. It throws FormatException
on Parse
and returns false
for TryParse
, indicating the input was not in a correct format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any dev that needs to override the behavior can and should by overriding in their derived type.
|
||
if (sourceStatus != OperationStatus.Done) | ||
{ | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets a bit complicated. The code as-is definitively returns "not a prefix match!" - even though the inputs are invalid to the point that the question is nonsensical. It's a bit like asking "is 'dog' less than 'cat'?" It's not true and it's not false. It's just... huh?
Throwing a wrench into this even more is that ICU actually does have specialized handling for invalid UTF-16 sequences. (Basically, it treats them as opaque chars that don't match anything except for themselves.) So this means that, e.g., the sequence "\uD800\uD801".StartsWith("\uD800")
would return true, even though both sequences are ill-formed. But in the logic you have here, "\xFF\xFE"u8.CultureAwareStartsWith("\xFF"u8)
would return false.
Recommendation: Invalid UTF-8 sequences should throw (not return false) or should utilize existing ICU handling for invalid UTF-16 sequences.
The way I worked around this in the UTF-8 prototype was to use something akin to WTF-16 for the comparison. Basically, every time an invalid UTF-8 byte is encountered, append an invalid UTF-16 code point to the string we're building. Any invalid UTF-8 byte 0x??
would map to the char U+DF??
. So for instance, the invalid byte 0x80
would map to the invalid char 0xDF80
, the invalid byte 0xC0
would map to the invalid char U+DFC0
, etc. The end result is that you're building up something that is mostly UTF-16, with the invalid sequences specifically chosen not to conflict with one another, which allows ICU to handle the input string appropriately.
This does not have the same algorithmic complexity constraints that I call out in IUtf8SpanParsable
, since ICU's implementation for culture-aware prefix matching is O(needle.Length + haystack.Length)
with no early-exit optimization. It's lamentable, but it is what it is. So that means the transcoding logic is upping some constant factor but isn't otherwise changing the algorithmic complexity itself.
(NLS, I believe, does have an early exit consideration. But that should be enough of an edge case that I'm willing to hold my nose for any algorithmic complexity changes we introduce.)
byteOffset += 4; | ||
length -= 4; | ||
} | ||
|
||
Debug.Assert(length == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am quite confused as to how testing was good before and is failing now, but this snippet looks suspicious. The length was 1-3, and byteOffset (though not used after this point) also would be increased by 1-3 (except for the 2 sometimes already added above)
This makes progress towards #81500
Still not covered are: