-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System.Text.Encodings.Web refactoring and code modernization #49373
Conversation
- Refactor unsafe code from TextEncoder workhorse routines into standalone helpers - Fix bounds check logic in workhorse routines - Remove vestigial code from the library and unit test project - Add significant unit test coverage for the workhorse routines and unsafe helpers
Tagging subscribers to this area: @tarekgh, @eiriktsarpalis, @layomia Issue DetailsBackground: System.Text.Encodings.Web contains significant amounts of unsafe code. Part of this is due to the fact that the abstraction itself is pointer-based. And part of it is due to efforts to increase performance in hot paths. However, because these code patterns end up in hot paths and tight loops, it's difficult to foresee all the different edge cases that might crop up. This can manifest itself as a reliability or a security problem. Given that this code is intended to run over untrusted input, this is not ideal. High-level overview of this PRThis PR refactors the System.Text.Encodings.Web project, modernizes much of the code to use
A brief tour of the filesAllowedBmpCodePointsBitmap.cs - a bitmap of "allowed vs. disallowed" flags for all BMP characters. The implementation of this class is unsafe and requires close review. However, all entry points are guaranteed safe, and there's a standalone unit test exercising edge cases for the unsafe implementation. AsciiByteMap.cs - a simple map of ASCII characters to single bytes, used for quick lookup by index. The implementation of this class is unsafe and requires close review. However, all entry points are guaranteed safe, and there's a standalone unit test exercising edge cases for the unsafe implementation. Default[Html|Url|JavaScript]Encoder.cs - the in-box implementations of [Html|Url|JavaScript]Encoder.cs - provide static factories around the OptimizedInboxTextEncoder.cs - contains the shared "find which characters need escaping and write out the escaped form" logic used by all of the in-box encoders. There's no longer a separate code path for JSON vs. everything else. There are some unsafe method overrides, but for the most part they just forward arguments and don't do anything particularly interesting. The implementation of OptimizedInboxTextEncoder.Ascii.cs - contains optimized lookup tables for ASCII escaping. The implementation of these methods is unsafe and requires close review. However, all entry points are guaranteed safe, and there are standalone unit tests for these APIs. OptimizedInboxTextEncoder.[Ssse3|Simd].cs - contains SSE3-optimized "find the first char / byte to escape" logic. The implementation of these methods is unsafe and requires close review. SpanUtility.cs - contains helper methods for working with and writing data to spans. The implementation of these methods is unsafe and requires close review. However, all entry points are guaranteed safe, and there are standalone unit tests for these APIs. TextEncoder.cs - contains naïve "find which characters need escaping and write out the escaped form" logic that can work for generalized encoding that doesn't fulfill the contracts provided by our inbox encoders. There are also shared helper optimization methods for handling string escaping, etc. The implementation of these methods is safe, modulo some unsafe method overrides that forward to safe alternatives. Polyfill\*.cs - contains internal polyfill implementations for APIs which are missing from downlevel.
PerformancePerformance numbers and discussion will be left as a comment within the issue. Other notes for reviewersThe package no longer builds for netstandard2.1, netcoreapp3.0, and net461. Instead, everything is unified as follows:
The existing SSE2 and ADVSIMD optimizations have been removed as part of this PR. The reason for this is that there's no longer a need for a "does this vector contain only ASCII bytes?" helper method. Instead, the SIMD ASCII-processing code paths have been written in terms of a pshufb-equivalent. For x86, this requires SSSE3.1. The ARM64 equivalent code path was never checked in to this library. That work will need to take place in order to restore the performance on ARM64. (/cc @carlossanlop @eiriktsarpalis) Fixes #39829. Ref: CVE-2021-26701 (MSRC 62749)
|
Performance resultsRaw performance numbers
Benchmark code
namespace ConsoleAppBenchmark
{
[SkipLocalsInit]
public class TextEncoderRunner
{
[Params(
"The quick brown fox jumps over the lazy dog.", // no escaping needed ever
"<div id=\"myDiv\">Escape & me!</div>", // contains some HTML / URL / JSON-sensitive chars
"Лорем ипсум долор сит амет, цоммуне малуиссет цонцлудатуряуе ад хис.")] // Cyrillic lipsum; no escaping needed (when Cyrillic allowed)
public string Arg { get; set; }
private byte[] _argUtf8;
private char[] _scratchBuffer = new char[1024];
private byte[] _scratchUtf8Buffer = new byte[1024];
[Params("HTML", "URL", "JSON-Default", "JSON-Relaxed")]
public string Encoder { get; set; }
private TextEncoder _encoder;
[GlobalSetup]
public void Setup()
{
_argUtf8 = Encoding.UTF8.GetBytes(Arg);
_encoder = Encoder switch
{
"HTML" => HtmlEncoder.Default,
"URL" => UrlEncoder.Default,
"JSON-Default" => JavaScriptEncoder.Default,
"JSON-Relaxed" => JavaScriptEncoder.UnsafeRelaxedJsonEscaping,
_ => throw new Exception("Unknown encoder."),
};
}
[Benchmark]
public unsafe int FindFirstCharToEncodeUtf16()
{
string arg = Arg;
_ = arg.Length; // deref; prove not null
fixed (char* pArg = arg)
{
return _encoder.FindFirstCharacterToEncode(pArg, arg.Length);
}
}
[Benchmark]
public int FindFirstCharToEncodeUtf8()
{
byte[] argUtf8 = _argUtf8;
_ = argUtf8.Length; // deref; prove not null
return _encoder.FindFirstCharacterToEncodeUtf8(argUtf8);
}
[Benchmark]
public string EncodeToStringUtf16()
{
return _encoder.Encode(Arg);
}
[Benchmark]
public OperationStatus EncodeToBufferUtf16()
{
string arg = Arg;
_ = arg.Length; // deref; prove not null
char[] dest = _scratchBuffer;
_ = dest.Length; // deref; prove not null
return _encoder.Encode(arg, dest, out _, out _);
}
[Benchmark]
public OperationStatus EncodeToBufferUtf8()
{
byte[] argUtf8 = _argUtf8;
_ = argUtf8.Length; // deref; prove not null
byte[] dest = _scratchUtf8Buffer;
_ = dest.Length; // deref; prove not null
return _encoder.EncodeUtf8(argUtf8, dest, out _, out _);
}
}
} Performance discussionsPerformance is generally better across the board, often significantly so. The performance improvement comes from three main places:
We also take advantage of recent PRs like #49180 to reduce the number of duplicate checks occurring inside our hot paths, opting to hoist these checks outside of the loop where possible. The notable exception to the performance improvement is the |
{ | ||
if (value.Value == '<') | ||
{ | ||
if (!SpanUtility.TryWriteBytes(destination, (byte)'&', (byte)'l', (byte)'t', (byte)';')) { goto OutOfSpace; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something crucial here, but wouldn't it be better to pass an uint directly and do BinaryPrimitives.ReverseEndianness if required? ReverseEndianess is a bswap, compared to multiple shift and or in TryWriteBytes or even pass the appropriate reversed endianess version directly if required?
If I read the sharplab asm correcctly, apparently having something like
uint v = (byte)'&' << 24 | (byte)'l' << 16 | (byte)'t' << 8 | (byte)';';
if (BitConverter.IsLittleEndian)
{
v = BinaryPrimitives.ReverseEndianness(v);
}
that would allow you to keep the bytes visible and not require some "strange" magic value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal was to keep the call site readable: "I'm writing these bytes in this order." If we want to change the implementation to call ReverseEndianness
as an implementation detail I'm fine with that. You're right in that the ReverseEndianness
API was intended to be optimized by the JIT as const input -> const output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(only half-way throught the code so far...enough for me today)
uint value; | ||
if (BitConverter.IsLittleEndian) | ||
{ | ||
value = ((uint)d << 24) | ((uint)c << 16) | ((uint)b << 8) | a; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe writing the individual bytes is faster as
- this produces quite a lot of machine code
- cpus have a store buffer, so flushing to L1 will be done in bigger "chunks" anyway
A quick micro-benchmark (on kaby-lake) proves that:
| Method | Mean | Error | StdDev | Ratio | RatioSD |
|------- |---------:|----------:|----------:|------:|--------:|
| A | 1.335 ns | 0.0656 ns | 0.1022 ns | 1.00 | 0.00 |
| B | 1.178 ns | 0.0620 ns | 0.0608 ns | 0.85 | 0.08 |
A...your code
B...individiual writes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the asm output on your box? On my box this code produces a single mov dword ptr [foo], CONST
instruction, which beats the performance of four mov byte ptr [foo + i], CONST
instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mov dword ptr [foo], CONST
Really a const
if a
, b
, ... are arguments?
Or just in the specific case where the method is inlined and the arguments can be evaluated as constant values?
In this case of course that's better.
For asm (note: I'm not on latest main-branch):
; Bench.A()
; ...
cmp r8d,4
jl short M00_L02
shl ecx,18
shl r10d,10
or ecx,r10d
shl r9d,8
or ecx,r9d
or eax,ecx
mov [rdx],eax
mov eax,1
jmp short M00_L03
M00_L02:
xor eax,eax
M00_L03:
ret
; Bench.B()
; ...
cmp r8d,4
jl short M00_L02
mov [rdx],al
mov [rdx+1],r9b
mov [rdx+2],r10b
mov [rdx+3],cl
mov eax,1
jmp short M00_L03
M00_L02:
xor eax,eax
M00_L03:
ret
The micro-benchmark is very flaky, but B
is always faster.
Although it's a micro benchmark, that doesn't take into account that the store buffer may be full, only one store can be dispatched per cycle, etc. as it can be on real world workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfoidl The specific use case for this API is that all value parameters are constants. That's also called out in the devdoc on the API. This causes the JIT to const-fold everything.
I'm investigating why this is failing in CI. Unit tests pass cleanly on my box. |
Hello JSON crew! You're pinged on this review because it changes the underlying System.Text.Encodings.Web implementation, and I had to adjust one of the System.Text.Json unit tests to account for the change. The System.Text.Json-specific change is cf8e998. It's a unit test only change. Basically, when the encoder sees invalid UTF-* data, it replaces that data with U+FFFD ('�') in the response. The unit test change makes the test resilient against the response containing either a literal '�' character or the escaped "\uFFFD" form, which are equivalent in JSON. |
The "all configurations" broken CI leg should be fixed by #49396. |
- Update test csproj to include missing polyfills - Fix net461 test compilation failures
/azp run runtime |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the SSE code and this looks good to me.
Left some nits.
} while ((i += 16) < lastLegalIterationFor16CharRead); | ||
} | ||
|
||
if ((lengthInBytes & 8) != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this is clever and produces nice test jmp
-combo that can be fused.
if (span.Length >= 6) | ||
{ | ||
ulong value64; | ||
uint value32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nit: naming, we have value
, hi
, lo
and these. Can this be unified? I like the value64
and value32
approach most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed them to abcd and ef, depending on what values they're intended to hold. I think this naming is a little clearer. Let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know what you think!
Now it's clear and a good naming that I like.
(Sorry for not replying earlier)
src/libraries/System.Text.Encodings.Web/src/System/Text/Encodings/Web/SpanUtility.cs
Outdated
Show resolved
Hide resolved
...es/System.Text.Encodings.Web/src/System/Text/Encodings/Web/OptimizedInboxTextEncoder.Simd.cs
Outdated
Show resolved
Hide resolved
...s/System.Text.Encodings.Web/src/System/Text/Encodings/Web/OptimizedInboxTextEncoder.Ssse3.cs
Outdated
Show resolved
Hide resolved
|
||
if ((lengthInBytes & 3) != 0) | ||
{ | ||
Debug.Assert(lengthInBytes - i <= 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the other branches have a Debug.Assert too?
Like Debug.Assert(lengthInBytes - i >= 8 && lengthInBytes - i < 16);
, etc.
...s/System.Text.Encodings.Web/src/System/Text/Encodings/Web/OptimizedInboxTextEncoder.Ssse3.cs
Outdated
Show resolved
Hide resolved
I'd be curious to see benchmark results on arm64. |
src/libraries/System.Text.Encodings.Web/ref/System.Text.Encodings.Web.csproj
Show resolved
Hide resolved
Might be a good excuse to learn how to compile and run perf tests on my SPX device. :) Per the comments at the top of the issue, I expect this will regress performance on arm64 for the "nothing needs to be escaped" code paths. However, the arm64 code paths here really needed to be reworked anyway in order to support pshufb-like semantics. Once that work is done, I expect arm64 performance here to be better than it was for 5.0. |
@eerhardt Interesting. I'm curious about the System.Private.CoreLib change in particular, as the only file touched there was https://github.com/dotnet/runtime/pull/49373/files#diff-3a22ee85ff262ccdbad92a3c073a523a039016bc6ee6cab752621493d6d46693, and I'm not sure how that trivial a change could have caused a 2KB regression. How does one begin investigating this? |
With trimming, changes that are the most impactful are often the higher level changes that cause more, or less, code to be kept in dependent assemblies. So refactoring System.Text.Encodings.Web can change it to use more APIs / code from CoreLib. Which means those APIs can no longer be trimmed after the refactoring.
Here are the steps I use to investigate size changes:
This gets you the "before" app in Now to get that app to use your change, what I typically do is "replace the NuGet package files with locally built files". I'm sure there are other approaches, but I found this to be the easiest.
Note: Sometimes the latest SDK and the latest main branch have changes between them. So it is good practice to do steps 1-5 above using the "before your changes" commit and the "after your changes" commit. This is what I did to get the numbers above. Tools for analysis I've found helpful are:
|
@eerhardt With the latest commit (d13f660), I removed the With this commit System.Private.CoreLib.dll.br is 1300 bytes above baseline. I'm still looking into possible improvements in System.Text.Encodings.Web.dll itself. |
Can you open an issue for this in https://github.com/mono/linker ? |
I can work around this for now by making AsVector a property instead of a field, but I'll need to disassemble again to make sure I'm not undoing the optimizations we got from #49180. internal readonly ref readonly Vector128<byte> AsVector
=> ref Unsafe.As<byte, Vector128<byte>>(ref Unsafe.AsRef(in AsBytes[0])); Edit: Looks like it's interfering with the optimizations and causing register stack spilling. :( |
src/libraries/System.Text.Encodings.Web/src/System/Text/Encodings/Web/TextEncoder.cs
Show resolved
Hide resolved
src/libraries/System.Text.Encodings.Web/tests/JavaScriptStringEncoderTests.Relaxed.cs
Show resolved
Hide resolved
@eiriktsarpalis I was experimenting with arm64 SIMD enablement over in my personal fork based on some feedback from @tannergooding. I still need to test the code, but the logic in that file is fairly close to the logic in the SSSE3-specific code paths. Trying to figure out how to get it over to my SPX device for perf testing. Edit: Something's going on with BenchmarkDotNet on that box, but I am able to perform some basic smoke testing. Things appear to be working correctly. I'll hold the ARM64 commit for now and send it as a separate PR once this is done. That way I can enlist help for running benchmarks and we can dedicate that separate PR just for ARM64-related discussion. Edit x2: Basic console app and stopwatch never fails. :) Baseline (release/5.0): 46 ns for |
Most recent 2 commits are unit test changes only to respond to PR feedback, no source or packaging changes. |
CI "build all configurations" failure is known issue which should be resolved by #49781. Edit: Since that PR might bake for a few days, I've cherry-picked two updates to the package baseline in the latest iteration of this PR to help unblock CI. The resulting merge conflict should be minimal. |
Latest commit (76f04f7) is merge from origin/main and conflict resolution, no code changes from previous PR review. |
Failing wasm test appears to be #48079. |
@GrabYourPitchforks we got some nice improvements from this PR: DrewScoggins/performance-2#4632 |
@adamsitnik This is probably also reflected in DrewScoggins/performance-2#4666. It's good to keep an eye on the System.Text.Json tests specifically to ensure that we didn't regress anything there. |
@GrabYourPitchforks we're seeing a big brower-wasm regression in System.Text.Json over a range that includes this #50260 |
@lewing Thanks for the pointer. I'll respond over in that thread. |
Was this ever done or did we ever measure the performance difference between x64 and arm64 and if there is a gap? Also, is there a tracking issue to optimize it for ARM64 (if it is slow)? |
It was addressed by #49847. |
Thanks. It is surprising that none of the MicroBenchmarks improvements were noticed or we might have missed triaging the improvements. CC: @DrewScoggins |
Background: System.Text.Encodings.Web contains significant amounts of unsafe code. Part of this is due to the fact that the abstraction itself is pointer-based. And part of it is due to efforts to increase performance in hot paths. However, because these code patterns end up in hot paths and tight loops, it's difficult to foresee all the different edge cases that might crop up. This can manifest itself as a reliability or a security problem. Given that this code is intended to run over untrusted input, this is not ideal.
High-level overview of this PR
This PR refactors the System.Text.Encodings.Web project, modernizes much of the code to use
Span<T>
and other safer APIs as appropriate, and fixes a handful of outstanding bugs.A brief tour of the files
AllowedBmpCodePointsBitmap.cs - a bitmap of "allowed vs. disallowed" flags for all BMP characters. The implementation of this class is unsafe and requires close review. However, all entry points are guaranteed safe, and there's a standalone unit test exercising edge cases for the unsafe implementation.
AsciiByteMap.cs - a simple map of ASCII characters to single bytes, used for quick lookup by index. The implementation of this class is unsafe and requires close review. However, all entry points are guaranteed safe, and there's a standalone unit test exercising edge cases for the unsafe implementation.
Default[Html|Url|JavaScript]Encoder.cs - the in-box implementations of
HtmlEncoder
,UrlEncoder
, andJavaScriptEncoder
. There is no longer a separate implementation for the default inbox JSON encoder vs. the unsafe relaxed JSON encoder: they both filter down to shared logic in DefaultJavaScriptEncoder.cs. These files also contain the core "how do I perform HTML / URL / JS escaping?" logic. These files now contain only safe code, modulo overriding some existing unsafe APIs and forwarding the arguments elsewhere.[Html|Url|JavaScript]Encoder.cs - provide static factories around the
Default\*Encoder
types. There's no longer any real logic in these types.OptimizedInboxTextEncoder.cs - contains the shared "find which characters need escaping and write out the escaped form" logic used by all of the in-box encoders. There's no longer a separate code path for JSON vs. everything else. There are some unsafe method overrides, but for the most part they just forward arguments and don't do anything particularly interesting. The implementation of
GetIndexOfFirstCharToEncode
is unsafe and requires close review.OptimizedInboxTextEncoder.Ascii.cs - contains optimized lookup tables for ASCII escaping. The implementation of these methods is unsafe and requires close review. However, all entry points are guaranteed safe, and there are standalone unit tests for these APIs.
OptimizedInboxTextEncoder.Ssse3.cs - contains SSE3-optimized "find the first char / byte to escape" logic. The implementation of these methods is unsafe and requires close review.
SpanUtility.cs - contains helper methods for working with and writing data to spans. The implementation of these methods is unsafe and requires close review. However, all entry points are guaranteed safe, and there are standalone unit tests for these APIs.
TextEncoder.cs - contains naïve "find which characters need escaping and write out the escaped form" logic that can work for generalized encoding that doesn't fulfill the contracts provided by our inbox encoders. There are also shared helper optimization methods for handling string escaping, etc. The implementation of these methods is safe, modulo some unsafe method overrides that forward to safe alternatives.
Polyfill\*.cs - contains internal polyfill implementations for APIs which are missing from downlevel.
Performance
Performance numbers and discussion will be left as a comment within the issue.
Other notes for reviewers
The package no longer builds for netstandard2.1 or netcoreapp3.0. Instead, everything is unified as follows:
The existing SSE2 and ADVSIMD optimizations have been removed as part of this PR. The reason for this is that there's no longer a need for a "does this vector contain only ASCII bytes?" helper method. Instead, the SIMD ASCII-processing code paths have been written in terms of a pshufb-equivalent. For x86, this requires SSSE3.1. The ARM64 equivalent code path was never checked in to this library. That work will need to take place in order to restore the performance on ARM64. (/cc @carlossanlop @eiriktsarpalis)
Fixes #39829.
Fixes #45994.
Fixes #48519.
Ref: CVE-2021-26701 (MSRC 62749)