-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite GetIndexOfFirstNonAsciiByte #104503
base: main
Are you sure you want to change the base?
Conversation
- Deduplicate SIMD paths - there is no more need for that - Remove at least one branch from short length path - Use up to 128x4/256x2 SIMD unrolling
This comment was marked as outdated.
This comment was marked as outdated.
Benchmark results on Arm64
|
Benchmark results on Amd
|
…512b loop, mitigate Dynamic PGO impact
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
1 similar comment
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Benchmark results on Arm64
|
Benchmark results on Amd
|
This comment was marked as outdated.
This comment was marked as outdated.
Benchmark results on Intel
|
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs
Outdated
Show resolved
Hide resolved
@EgorBot -amd -arm64 using System.Text.Unicode;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
// Let's try again
BenchmarkRunner.Run<AsciiBench>(args: args);
public class AsciiBench
{
[ParamsSource(nameof(Lengths))]
public int Length;
public IEnumerable<int> Lengths => [..Enumerable.Range(1, 8), 12, 16, 25, 50, 70, 200, 512, 16384];
byte[] ascii = [];
[GlobalSetup]
public void Setup()
{
ascii = new byte[Length];
ascii.AsSpan().Fill((byte)'f');
}
[Benchmark]
public bool ValidateAscii()
{
return Utf8.IsValid(ascii);
}
} |
Benchmark results on Intel
|
FYI the original implementation of this method waaaaay back when did indeed use byrefs instead of raw pointers, but IIRC @jkotas had concerns with that approach and much preferred the simplicity of using true pointers. I'd recommend asking him for his thoughts on this PR. |
Yes, I still think that unmanaged pointers are less bug prone than byref arithmetics. We have been using byref arithmetics all over the place just because of we can - I have stopped commenting on it. |
My idea was to move the ASCII path over to byrefs first to allow moving the UTF-8 path to them later (once Egor's PR for adopting Lemire's UTF-8 validation is done, possibly adding ARM64 path if it doesn't happen in scope of it). Once done, this would have allowed to improve call codegen around However, if you think this PR should be changed back to pointer arithmetics - please let me know. It was an optional goal of the change, and I'm happy as long as the performance of this hot path improves. Thanks! |
CC @jeffhandley As an area owner, PTAL when you have time. Thanks! |
@tannergooding will you continue reviewing this change? |
Description
The purpose of this change is to simplify
GetIndexOfFirstNonAsciiByte
by servicing all targets with a single path, which is now possible, and achieving optimal throughput on longer lengths.Per "Validating gigabytes of Unicode strings per second… in C#?", it appears that we leave a lot of performance on the table for ARM64 target.
Also closes #89924
Analysis
After addressing feedback, this PR now improves maximum non-ASCII byte search throughput by 3x for AVX2 (Zen 3) and AdvSimd (Neoverse N1), and 2x for AVX512 (Ice Lake). Short lengths improve by up to 25%, and the worst case regression is on AMD only for <5% for lengths ~65-200 (under 1ns impact).
Benchmark
Results
Ice Lake, Zen 3 and N1: #104503 (comment)
M1 Pro