Spanify some `Utf8String` and `Utf8StringBuilder` use #102101

PaulusParssinen · 2024-05-10T21:45:13Z

Address TODO in the Utf8String and Utf8StringBuilder used by ILC and R2R. Saw some potential use for newer more efficient APIs. These are seem to be primarily used by name mangling, which can definitely be improved further.

Let's see if anything breaks in CI first.. Tried to avoid any "public" surface changes because these shared files seem to be used sneakily through the labyrinth of .csprojs s. ~~For example did not rename UnderlyingArray -> UnderlyingSpan~~ and did not try remove the implicit Utf8String -> string casting even though that was a bit of an headache while doing this (changing that is too spooky due to overload resolution).

update: CI seems to be good. Any extra stuff I should be concerned about and try building locally when touching these parts of code?

* The quick is-valid-ascii check in UTF8 encoding _hopefully_ makes this worth the simplification, even though 99% of the inputs are just ASCII. I'm ready to revert this.

* Very much inspired (copied) from dotnet#75851

src/coreclr/tools/Common/Internal/Text/Utf8String.cs

src/coreclr/tools/Common/Internal/Text/Utf8StringBuilder.cs

neon-sunset · 2024-05-11T00:46:35Z

Now that you are working on this, if you ever have an appetite for public facing API changes (either here or subsequent PR), it is worth changing Utf8StringBuilder to pool arrays, either becoming ValueUtf8StringBuilder (or shorter name) or as is as a start.

PaulusParssinen · 2024-05-11T00:50:35Z

Now that you are working on this, if you ever have an appetite for public facing API changes (either here or subsequent PR), it is worth changing Utf8StringBuilder to pool arrays, either becoming ValueUtf8StringBuilder (or shorter name) or as is as a start.

I have all sorts of ideas like that; utf8 interpolated string handlers and what not. My first knee-jerk reaction was to look to make either of these a ref struct and actually just have a underlying span. I was gonna hoist Encoding.UTF8 accesses when done multiple times in a method. I was gonna do MaxByteCount instead of GetByteCount. That Append(char) is actually dead code in a way too.

But I'm still paranoid of what is indirectly consuming this API. This won't be last PR so I start with small and try get a feel for what kind of optimizations code owners here feel comfortable with.

neon-sunset · 2024-05-11T01:00:52Z

utf8 interpolated string handlers and what not

Feel free to steal the code from U8String then if you do go for that, its "default" interpolated string handler pretty much reaches performance ceiling save for special-casing conversion for integers as the path there is a bit heavy but overall fast enough, it beats DefaultInterpolatedStringHandler by ~2x on short lengths. Given this type would still be internal, you could be more aggressive in inline buffer within a struct (I had to tone it down from 256B to 128B and then to just 64B due to stack pressure, copies and state machine size impact).

In the case of ILC it might be worth it to just get a flamegraph first however - good question if it's worth the effort given limited utilization (you can unroll by hand char conversion in a way for JIT to fold it, including the copy, to the correct code point length branch, unroll 1-2-3 byte span copies manually too, etc. - there is a lot of bikeshedding potential).

src/coreclr/tools/Common/Internal/Text/Utf8StringBuilder.cs

Co-authored-by: Jan Kotas <[email protected]>

This reverts commit ace1494. It was used by DwarfEhFrame.cs

src/coreclr/tools/Common/Internal/Text/Utf8StringBuilder.cs

Co-authored-by: Jan Kotas <[email protected]>

* Only ASCII constant chars we're passed to this method

src/coreclr/tools/Common/Internal/Text/Utf8StringBuilder.cs

Co-authored-by: Paulus Pärssinen <[email protected]>

src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/ObjectWriter/Dwarf/DwarfEhFrame.cs

jkotas · 2024-05-11T14:21:46Z

Use UTF8 literals for single characters too

I think it would look better to keep Append(char c) method and assert inside the method that c is ASCII char.

This reverts commit 8a97513.

This reverts commit 5dd1cbf.

…te)" This reverts commit 2e050a3.

jkotas · 2024-05-13T17:00:33Z

/azp run runtime-nativeaot-outerloop

azure-pipelines · 2024-05-13T17:00:52Z

Azure Pipelines successfully started running 1 pipeline(s).

jkotas

Thanks

) * Address TODOs * Make Utf8String readonly * Make Utf8StringBuilder.Append do two Encoding.UTF8 calls instead of many * The quick is-valid-ascii check in UTF8 encoding _hopefully_ makes this worth the simplification, even though 99% of the inputs are just ASCII. I'm ready to revert this. * Use UTF8 literals for the appended constant strings in ILC * Use UTF8 literals for the appended constant strings in R2R * Use pattern match in Utf8String.Equals(object) * Use SequenceEquals in Utf8String.Equals(Utf8String) * Use CommonPrefixLength in Utf8String.Compare(Utf8String, Utf8String) * Very much inspired (copied) from dotnet#75851 * Remove unused Utf8StringBuilder.LastCharBoundary * Use SequenceCompareTo * UnderlyingArray -> AsSpan() * Only return filled-portion of the buffer in Utf8StringBuilder.AsSpan Co-authored-by: Jan Kotas <[email protected]> * Write null-terminator for the augmentationString. * Utf8StringBuilder.Append(char) -> Utf8StringBuilder.Append(byte) * Only ASCII constant chars we're passed to this method * Add Ascii.IsValid assert to Utf8StringBuilder.Append --------- Co-authored-by: Jan Kotas <[email protected]> Co-authored-by: Michal Strehovský <[email protected]>

PaulusParssinen added 9 commits May 11, 2024 00:39

Address TODOs

56490e8

Make Utf8String readonly

ea1e855

Make Utf8StringBuilder.Append do two Encoding.UTF8 calls instead of many

bc4e50e

* The quick is-valid-ascii check in UTF8 encoding _hopefully_ makes this worth the simplification, even though 99% of the inputs are just ASCII. I'm ready to revert this.

Use UTF8 literals for the appended constant strings in ILC

ca24552

Use UTF8 literals for the appended constant strings in R2R

8d3e0db

Use pattern match in Utf8String.Equals(object)

7431393

Use SequenceEquals in Utf8String.Equals(Utf8String)

bd9e6ef

Use CommonPrefixLength in Utf8String.Compare(Utf8String, Utf8String)

cb0eb90

* Very much inspired (copied) from dotnet#75851

Remove unused Utf8StringBuilder.LastCharBoundary

2ed3a64

PaulusParssinen requested a review from MichalStrehovsky as a code owner May 10, 2024 21:45

dotnet-issue-labeler bot added the area-crossgen2-coreclr label May 10, 2024

PaulusParssinen changed the title ~~Spanify some Utf8String and Utf8StringBuilder use~~ Spanify some Utf8String and Utf8StringBuilder use May 10, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 10, 2024

neon-sunset reviewed May 11, 2024

View reviewed changes

src/coreclr/tools/Common/Internal/Text/Utf8String.cs Outdated Show resolved Hide resolved

src/coreclr/tools/Common/Internal/Text/Utf8StringBuilder.cs Show resolved Hide resolved

jkotas reviewed May 11, 2024

View reviewed changes

src/coreclr/tools/Common/Internal/Text/Utf8StringBuilder.cs Outdated Show resolved Hide resolved

PaulusParssinen and others added 4 commits May 11, 2024 02:26

Remove Utf8StringBuilder.UnderlyingArray

ace1494

Co-authored-by: Jan Kotas <[email protected]>

Revert "Remove Utf8StringBuilder.UnderlyingArray"

0002c62

This reverts commit ace1494. It was used by DwarfEhFrame.cs

Use SequenceCompareTo

fe86998

UnderlyingArray -> AsSpan()

51b63e3

jkotas reviewed May 11, 2024

View reviewed changes

src/coreclr/tools/Common/Internal/Text/Utf8StringBuilder.cs Outdated Show resolved Hide resolved

PaulusParssinen and others added 4 commits May 11, 2024 04:00

Only return filled-portion of the buffer in Utf8StringBuilder.AsSpan

bafd05f

Co-authored-by: Jan Kotas <[email protected]>

Write null-terminator for the augmentationString.

77a600d

Utf8StringBuilder.Append(char) -> Utf8StringBuilder.Append(byte)

2e050a3

* Only ASCII constant chars we're passed to this method

Remove unnecessary cast

5dd1cbf

PaulusParssinen commented May 11, 2024

View reviewed changes

src/coreclr/tools/Common/Internal/Text/Utf8StringBuilder.cs Outdated Show resolved Hide resolved

PaulusParssinen commented May 11, 2024

View reviewed changes

src/coreclr/tools/Common/Internal/Text/Utf8StringBuilder.cs Outdated Show resolved Hide resolved

PaulusParssinen commented May 11, 2024

View reviewed changes

src/coreclr/tools/Common/Internal/Text/Utf8StringBuilder.cs Show resolved Hide resolved

jkotas and others added 2 commits May 10, 2024 22:42

Update src/coreclr/tools/Common/Internal/Text/Utf8StringBuilder.cs

19106c7

Co-authored-by: Paulus Pärssinen <[email protected]>

Update src/coreclr/tools/Common/Internal/Text/Utf8StringBuilder.cs

00d4e54

Co-authored-by: Paulus Pärssinen <[email protected]>

build-analysis bot mentioned this pull request May 11, 2024

Test failure in System.Numerics.Tensors.Tests.SingleGenericTensorPrimitives.SpanDestinationFunctions_SpecialValues #101731

Closed

am11 reviewed May 11, 2024

View reviewed changes

src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/ObjectWriter/Dwarf/DwarfEhFrame.cs Outdated Show resolved Hide resolved

Use UTF8 literals for single characters too

8a97513

PaulusParssinen and others added 5 commits May 11, 2024 17:39

Revert "Use UTF8 literals for single characters too"

bb4477b

This reverts commit 8a97513.

Revert "Remove unnecessary cast"

7838413

This reverts commit 5dd1cbf.

Revert "Utf8StringBuilder.Append(char) -> Utf8StringBuilder.Append(by…

1f80778

…te)" This reverts commit 2e050a3.

Add Ascii.IsValid assert to Utf8StringBuilder.Append

b7c0fdb

Merge branch 'main' into spanify-internal-text-utf8string

ae917bc

This was referenced May 12, 2024

slow macOS - "##[error]The job running on agent Azure Pipelines 9 ran longer than the maximum time of 60 minutes." dotnet/dnceng#1883

Open

Dead lettering tests #101524

Closed

Merge branch 'main' into spanify-internal-text-utf8string

b007b64

jkotas approved these changes May 13, 2024

View reviewed changes

jkotas merged commit b48a639 into dotnet:main May 14, 2024
108 checks passed

PaulusParssinen deleted the spanify-internal-text-utf8string branch May 15, 2024 03:51

github-actions bot locked and limited conversation to collaborators Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spanify some `Utf8String` and `Utf8StringBuilder` use #102101

Spanify some `Utf8String` and `Utf8StringBuilder` use #102101

PaulusParssinen commented May 10, 2024 •

edited

Loading

neon-sunset commented May 11, 2024 •

edited

Loading

PaulusParssinen commented May 11, 2024 •

edited

Loading

neon-sunset commented May 11, 2024 •

edited

Loading

jkotas commented May 11, 2024

jkotas commented May 13, 2024

azure-pipelines bot commented May 13, 2024

jkotas left a comment

Spanify some Utf8String and Utf8StringBuilder use #102101

Spanify some Utf8String and Utf8StringBuilder use #102101

Conversation

PaulusParssinen commented May 10, 2024 • edited Loading

neon-sunset commented May 11, 2024 • edited Loading

PaulusParssinen commented May 11, 2024 • edited Loading

neon-sunset commented May 11, 2024 • edited Loading

jkotas commented May 11, 2024

jkotas commented May 13, 2024

azure-pipelines bot commented May 13, 2024

jkotas left a comment

Choose a reason for hiding this comment

Spanify some `Utf8String` and `Utf8StringBuilder` use #102101

Spanify some `Utf8String` and `Utf8StringBuilder` use #102101

PaulusParssinen commented May 10, 2024 •

edited

Loading

neon-sunset commented May 11, 2024 •

edited

Loading

PaulusParssinen commented May 11, 2024 •

edited

Loading

neon-sunset commented May 11, 2024 •

edited

Loading