Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: optimize the ARM function for systems with weak SIMD performance #50

Merged
merged 5 commits into from
Jun 29, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 32 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,38 +145,38 @@ faster than the standard library.
| Latin-Lipsum | 87 | 38 | 2.3 x |
| Russian-Lipsum | 7.4 | 2.7 | 2.7 x |

On a Neoverse V1 (Graviton 3), our validation function is 1.3 to over four times
On a Neoverse V1 (Graviton 3), our validation function is 1.3 to over five times
faster than the standard library.

| data set | SimdUnicode speed (GB/s) | .NET speed (GB/s) | speed up |
|:----------------|:-----------|:--------------------------|:-------------------|
| Twitter.json | 12 | 8.7 | 1.4 x |
| Arabic-Lipsum | 3.4 | 2.0 | 1.7 x |
| Chinese-Lipsum | 3.4 | 2.6 | 1.3 x |
| Emoji-Lipsum | 3.4 | 0.8 | 4.3 x |
| Hebrew-Lipsum | 3.4 | 2.0 | 1.7 x |
| Hindi-Lipsum | 3.4 | 1.6 | 2.1 x |
| Japanese-Lipsum | 3.4 | 2.4  | 1.4 x |
| Korean-Lipsum | 3.4 | 1.3 | 2.6 x |
| Twitter.json | 14 | 8.7 | 1.4 x |
| Arabic-Lipsum | 4.2 | 2.0 | 2.1 x |
| Chinese-Lipsum | 4.2 | 2.6 | 1.6 x |
| Emoji-Lipsum | 4.2 | 0.8 | 5.3 x |
| Hebrew-Lipsum | 4.2 | 2.0 | 2.1 x |
| Hindi-Lipsum | 4.2 | 1.6 | 2.6 x |
| Japanese-Lipsum | 4.2 | 2.4  | 1.8 x |
| Korean-Lipsum | 4.2 | 1.3 | 3.2 x |
| Latin-Lipsum | 42 | 17 | 2.5 x |
| Russian-Lipsum | 3.3 | 0.95 | 3.5 x |
| Russian-Lipsum | 4.2 | 0.95 | 4.4 x |


On a Qualcomm 8cx gen3 (Windows Dev Kit 2023), we get roughly the same relative performance
boost as the Neoverse V1.

| data set | SimdUnicode speed (GB/s) | .NET speed (GB/s) | speed up |
|:----------------|:-----------|:--------------------------|:-------------------|
| Twitter.json | 15 | 10 | 1.5 x |
| Arabic-Lipsum | 4.0 | 2.3 | 1.7 x |
| Chinese-Lipsum | 4.0 | 2.9 | 1.4 x |
| Emoji-Lipsum | 4.0 | 0.9 | 4.4 x |
| Hebrew-Lipsum | 4.0 | 2.3 | 1.7 x |
| Hindi-Lipsum | 4.0 | 1.9 | 2.1 x |
| Japanese-Lipsum | 4.0 | 2.7  | 1.5 x |
| Korean-Lipsum | 4.0 | 1.5 | 2.7 x |
| Twitter.json | 17 | 10 | 1.7 x |
| Arabic-Lipsum | 5.0 | 2.3 | 2.2 x |
| Chinese-Lipsum | 5.0 | 2.9 | 1.7 x |
| Emoji-Lipsum | 5.0 | 0.9 | 5.5 x |
| Hebrew-Lipsum | 5.0 | 2.3 | 2.2 x |
| Hindi-Lipsum | 5.0 | 1.9 | 2.6 x |
| Japanese-Lipsum | 5.0 | 2.7  | 1.9 x |
| Korean-Lipsum | 5.0 | 1.5 | 3.3 x |
| Latin-Lipsum | 50 | 20 | 2.5 x |
| Russian-Lipsum | 4.0 | 1.2 | 3.3 x |
| Russian-Lipsum | 5.0 | 1.2 | 5.2 x |


On a Neoverse N1 (Graviton 2), our validation function is 1.3 to over four times
Expand All @@ -195,23 +195,23 @@ faster than the standard library.
| Latin-Lipsum | 42 | 17 | 2.5 x |
| Russian-Lipsum | 3.3 | 0.95 | 3.5 x |

On a Neoverse N1 (Graviton 2), our validation function is up to three times
On a Neoverse N1 (Graviton 2), our validation function is up to over three times
faster than the standard library.


| data set | SimdUnicode speed (GB/s) | .NET speed (GB/s) | speed up |
|:----------------|:-----------|:--------------------------|:-------------------|
| Twitter.json | 7.0 | 5.7 | 1.2 x |
| Arabic-Lipsum | 2.2 | 0.9 | 2.4 x |
| Chinese-Lipsum | 2.1 | 1.8 | 1.1 x |
| Emoji-Lipsum | 1.8 | 0.7 | 2.6 x |
| Hebrew-Lipsum | 2.0 | 0.9 | 2.2 x |
| Hindi-Lipsum | 2.0 | 1.0 | 2.0 x |
| Japanese-Lipsum | 2.1 | 1.7  | 1.2 x |
| Korean-Lipsum | 2.2 | 1.0 | 2.2 x |
| Latin-Lipsum | 24 | 13 | 1.8 x |
| Russian-Lipsum | 2.1 | 0.7 | 3.0 x |

One difficulty with ARM processors is that they have varied SIMD/NEON performance. For example, Neoverse N1 processors, not to be confused with the Neoverse V1 design used by AWS Graviton 3, have weak SIMD performance. Of course, one can pick and choose which approach is best and it is not necessary to apply SimdUnicode is all cases. We expect good performance on recent ARM-based Qualcomm processors.
| Twitter.json | 7.8 | 5.7 | 1.4 x |
| Arabic-Lipsum | 2.5 | 0.9 | 2.8 x |
| Chinese-Lipsum | 2.5 | 1.8 | 1.4 x |
| Emoji-Lipsum | 2.5 | 0.7 | 3.6 x |
| Hebrew-Lipsum | 2.5 | 0.9 | 2.7 x |
| Hindi-Lipsum | 2.3 | 1.0 | 2.3 x |
| Japanese-Lipsum | 2.4 | 1.7  | 1.4 x |
| Korean-Lipsum | 2.5 | 1.0 | 2.5 x |
| Latin-Lipsum | 23 | 13 | 1.8 x |
| Russian-Lipsum | 2.3 | 0.7 | 3.3 x |


## Building the library

Expand Down
67 changes: 54 additions & 13 deletions src/UTF8.cs
Original file line number Diff line number Diff line change
Expand Up @@ -1277,7 +1277,18 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
}
return GetPointerToFirstInvalidByteScalar(pInputBuffer + processedLength, inputLength - processedLength, out utf16CodeUnitCountAdjustment, out scalarCountAdjustment);
}

public static void ToString(Vector128<byte> v)
{
Span<byte> b = stackalloc byte[16];
v.CopyTo(b);
Console.WriteLine(Convert.ToHexString(b));
Copy link
Collaborator

@EgorBo EgorBo Jun 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can file an API proposal for dotnet/runtime to introduce an API something like:

var hex = v.ToString("X"); // common format symbol for hex

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!! I did not mean to leave this code there. Removed.

I can file the proposal if you think that's useful that I do it, or I can support it if you do so (if that's useful that I do so).

}
public static void ToString(Vector128<sbyte> v)
{
Span<byte> b = stackalloc byte[16];
v.AsByte().CopyTo(b);
Console.WriteLine(Convert.ToHexString(b));
}
public unsafe static byte* GetPointerToFirstInvalidByteArm64(byte* pInputBuffer, int inputLength, out int utf16CodeUnitCountAdjustment, out int scalarCountAdjustment)
{
int processedLength = 0;
Expand Down Expand Up @@ -1360,18 +1371,31 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
// The block goes from processedLength to processedLength/16*16.
int contbytes = 0; // number of continuation bytes in the block
int n4 = 0; // number of 4-byte sequences that start in this block
/////
// Design:
// Instead of updating n4 and contbytes continuously, we accumulate
// the values in n4v and contv, while using overflowCounter to make
// sure we do not overflow. This allows you to reach good performance
// on systems where summing across vectors is slow.
////
Vector128<sbyte> n4v = Vector128<sbyte>.Zero;
Vector128<sbyte> contv = Vector128<sbyte>.Zero;
int overflowCounter = 0;
for (; processedLength + 16 <= inputLength; processedLength += 16)
{

Vector128<byte> currentBlock = AdvSimd.LoadVector128(pInputBuffer + processedLength);
if ((currentBlock & v80) == Vector128<byte>.Zero)
// We could also use (AdvSimd.Arm64.MaxAcross(currentBlock).ToScalar() <= 127) but it is slower on some
// hardware.
{
// We have an ASCII block, no need to process it, but
// we need to check if the previous block was incomplete.
if (prevIncomplete != Vector128<byte>.Zero)
{
contbytes += -AdvSimd.Arm64.AddAcrossWidening(contv).ToScalar();
if (n4v != Vector128<sbyte>.Zero)
{
n4 += -AdvSimd.Arm64.AddAcrossWidening(n4v).ToScalar();
}
int off = processedLength >= 3 ? processedLength - 3 : processedLength;
byte* invalidBytePointer = SimdUnicode.UTF8.SimpleRewindAndValidateWithErrors(16 - 3, pInputBuffer + processedLength - 3, inputLength - processedLength + 3);
// So the code is correct up to invalidBytePointer
Expand Down Expand Up @@ -1432,11 +1456,13 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
Vector128<byte> must23 = AdvSimd.Or(isThirdByte, isFourthByte);
Vector128<byte> must23As80 = AdvSimd.And(must23, v80);
Vector128<byte> error = AdvSimd.Xor(must23As80, sc);
// AdvSimd.Arm64.MaxAcross(error) works, but it might be slower
// than AdvSimd.Arm64.MaxAcross(Vector128.AsUInt32(error)) on some
// hardware:
if (error != Vector128<byte>.Zero)
{
contbytes += -AdvSimd.Arm64.AddAcrossWidening(contv).ToScalar();
if (n4v != Vector128<sbyte>.Zero)
{
n4 += -AdvSimd.Arm64.AddAcrossWidening(n4v).ToScalar();
}
byte* invalidBytePointer;
if (processedLength == 0)
{
Expand All @@ -1459,17 +1485,32 @@ private unsafe static (int utfadjust, int scalaradjust) calculateErrorPathadjust
return invalidBytePointer;
}
prevIncomplete = AdvSimd.SubtractSaturate(currentBlock, maxValue);
contbytes += -AdvSimd.Arm64.AddAcross(AdvSimd.CompareLessThanOrEqual(Vector128.AsSByte(currentBlock), largestcont)).ToScalar();
Vector128<byte> largerthan0f = AdvSimd.CompareGreaterThan(currentBlock, fourthByteMinusOne);
if (largerthan0f != Vector128<byte>.Zero)
contv += AdvSimd.CompareLessThanOrEqual(Vector128.AsSByte(currentBlock), largestcont);
n4v += AdvSimd.CompareGreaterThan(currentBlock, fourthByteMinusOne).AsSByte();
overflowCounter++;
// We have a risk of overflow if overflowCounter reaches 255,
// in which case, we empty contv and n4v, and update contbytes and
// n4.
if (overflowCounter == 0xff)
{
byte n4add = (byte)AdvSimd.Arm64.AddAcross(largerthan0f).ToScalar();
int negn4add = (int)(byte)-n4add;
n4 += negn4add;
overflowCounter = 0;
contbytes += -AdvSimd.Arm64.AddAcrossWidening(contv).ToScalar();
contv = Vector128<sbyte>.Zero;
if (n4v != Vector128<sbyte>.Zero)
{
n4 += -AdvSimd.Arm64.AddAcrossWidening(n4v).ToScalar();
n4v = Vector128<sbyte>.Zero;
}
}
}
}
bool hasIncompete = (prevIncomplete != Vector128<byte>.Zero);
contbytes += -AdvSimd.Arm64.AddAcrossWidening(contv).ToScalar();
if (n4v != Vector128<sbyte>.Zero)
{
n4 += -AdvSimd.Arm64.AddAcrossWidening(n4v).ToScalar();
}

bool hasIncompete = (prevIncomplete != Vector128<byte>.Zero);
if (processedLength < inputLength || hasIncompete)
{
byte* invalidBytePointer;
Expand Down