Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanify some Utf8String and Utf8StringBuilder use #102101

Merged
merged 26 commits into from
May 14, 2024

Conversation

PaulusParssinen
Copy link
Contributor

@PaulusParssinen PaulusParssinen commented May 10, 2024

Address TODO in the Utf8String and Utf8StringBuilder used by ILC and R2R. Saw some potential use for newer more efficient APIs. These are seem to be primarily used by name mangling, which can definitely be improved further.

Let's see if anything breaks in CI first.. Tried to avoid any "public" surface changes because these shared files seem to be used sneakily through the labyrinth of .csprojs s. For example did not rename UnderlyingArray -> UnderlyingSpan and did not try remove the implicit Utf8String -> string casting even though that was a bit of an headache while doing this (changing that is too spooky due to overload resolution).

update: CI seems to be good. Any extra stuff I should be concerned about and try building locally when touching these parts of code?

@PaulusParssinen PaulusParssinen changed the title Spanify some Utf8String and Utf8StringBuilder use Spanify some Utf8String and Utf8StringBuilder use May 10, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 10, 2024
@neon-sunset
Copy link
Contributor

neon-sunset commented May 11, 2024

Now that you are working on this, if you ever have an appetite for public facing API changes (either here or subsequent PR), it is worth changing Utf8StringBuilder to pool arrays, either becoming ValueUtf8StringBuilder (or shorter name) or as is as a start.

@PaulusParssinen
Copy link
Contributor Author

PaulusParssinen commented May 11, 2024

Now that you are working on this, if you ever have an appetite for public facing API changes (either here or subsequent PR), it is worth changing Utf8StringBuilder to pool arrays, either becoming ValueUtf8StringBuilder (or shorter name) or as is as a start.

I have all sorts of ideas like that; utf8 interpolated string handlers and what not. My first knee-jerk reaction was to look to make either of these a ref struct and actually just have a underlying span. I was gonna hoist Encoding.UTF8 accesses when done multiple times in a method. I was gonna do MaxByteCount instead of GetByteCount. That Append(char) is actually dead code in a way too.

But I'm still paranoid of what is indirectly consuming this API. This won't be last PR so I start with small and try get a feel for what kind of optimizations code owners here feel comfortable with.

@neon-sunset
Copy link
Contributor

neon-sunset commented May 11, 2024

utf8 interpolated string handlers and what not

Feel free to steal the code from U8String then if you do go for that, its "default" interpolated string handler pretty much reaches performance ceiling save for special-casing conversion for integers as the path there is a bit heavy but overall fast enough, it beats DefaultInterpolatedStringHandler by ~2x on short lengths. Given this type would still be internal, you could be more aggressive in inline buffer within a struct (I had to tone it down from 256B to 128B and then to just 64B due to stack pressure, copies and state machine size impact).

In the case of ILC it might be worth it to just get a flamegraph first however - good question if it's worth the effort given limited utilization (you can unroll by hand char conversion in a way for JIT to fold it, including the copy, to the correct code point length branch, unroll 1-2-3 byte span copies manually too, etc. - there is a lot of bikeshedding potential).

@jkotas
Copy link
Member

jkotas commented May 11, 2024

Use UTF8 literals for single characters too

I think it would look better to keep Append(char c) method and assert inside the method that c is ASCII char.

@jkotas
Copy link
Member

jkotas commented May 13, 2024

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@jkotas jkotas merged commit b48a639 into dotnet:main May 14, 2024
108 checks passed
@PaulusParssinen PaulusParssinen deleted the spanify-internal-text-utf8string branch May 15, 2024 03:51
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
)

* Address TODOs

* Make Utf8String readonly

* Make Utf8StringBuilder.Append do two Encoding.UTF8 calls instead of many

* The quick is-valid-ascii check in UTF8 encoding _hopefully_ makes this worth the simplification, even though 99% of the inputs are just ASCII. I'm ready to revert this.

* Use UTF8 literals for the appended constant strings in ILC

* Use UTF8 literals for the appended constant strings in R2R

* Use pattern match in Utf8String.Equals(object)

* Use SequenceEquals in Utf8String.Equals(Utf8String)

* Use CommonPrefixLength in Utf8String.Compare(Utf8String, Utf8String)

* Very much inspired (copied) from dotnet#75851

* Remove unused Utf8StringBuilder.LastCharBoundary

* Use SequenceCompareTo

* UnderlyingArray -> AsSpan()

* Only return filled-portion of the buffer in Utf8StringBuilder.AsSpan

Co-authored-by: Jan Kotas <[email protected]>

* Write null-terminator for the augmentationString.

* Utf8StringBuilder.Append(char) -> Utf8StringBuilder.Append(byte)

* Only ASCII constant chars we're passed to this method

* Add Ascii.IsValid assert to Utf8StringBuilder.Append

---------

Co-authored-by: Jan Kotas <[email protected]>
Co-authored-by: Michal Strehovský <[email protected]>
@github-actions github-actions bot locked and limited conversation to collaborators Jun 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-crossgen2-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants