-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[browser][non-icu] HybridGlobalization
indexing
#84471
Conversation
Tagging subscribers to 'arch-wasm': @lewing Issue DetailsImplements a chunk of web-api based globalization. Is a part of HybridGlobalization feature and contributes to #79989. Old, icu-based private API: GlobalizationNative_IndexOf, GlobalizationNative_LastIndexOf New, non-icu private API: IndexOf. Affected public API (see: tests in CompareInfoTests.IndexOf and CompareInfoTests.LastIndexOf.cs):
All changes in behavior are listed in docs\design\features\hybrid-globalization.md cc @SamMonoRT
|
Something I should have noticed in previous reviews: In HybridGlobalization right now you pass strings as ptr+length pairs. This means that any time an interned string is passed across the boundary to JS, it has to be decoded to a new JS string that gets thrown away. If you were to pass runtime/src/mono/wasm/runtime/strings.ts Line 112 in 50c1214
For cases where you are indexing into the char* buffer one element at a time and never making a temporary JS string, it's fine to not use this. It seems like it would probably be worth applying this to IndexOf/LastIndexOf and comparisons, but case conversion doesn't seem likely to need interning. |
Thank you, I did not know about interning. The original motivation for using pointers was optimizing memory usage - to avoid creating new "return" strings on JS side when we can update the string allocated by the managed code. I would like to ask @pavelsavara for opinion here because he hinted me to the current approach. Changing the way |
I don't remember the details of that conversation anymore. Looking at the PR now, I wonder why we are using icall instead of the new generated interop. With the later, you would get the string interning without any effort and trouble with GC roots. |
That is true but using new Interop inside of |
I see. In that case Kate's suggesting is better. |
/// <returns></returns> | ||
// ToDo: clean up with IndexOfOrdinalHelper from .Icu | ||
private unsafe int IndexOfOrdinalIgnoreCaseHelperJS(out string exceptionMessage, ReadOnlySpan<char> source, ReadOnlySpan<char> target, CompareOptions options, int* matchLengthPtr, bool fromBeginning) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it better to move this function https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.Icu.cs#L95 to CompareInfo.cs and call from ICU and WebAssembly? From first look implementations look identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It differs with one goto: InteropCall
. I tried to merge these two implementations: icu + wasm by passing the delegate to the InteropCall
but unfortunately delegates do not support unsafe parameters (pointers), so the cleanup failed. Generally all CompareInfo
should get refactored because it's a huge class and I was thinking about doing it in follow-up, when collations on HybridGlobalization would be already working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I have some concerns about refactoring these functions. Originally, ICU has 2 very similar functions that differ only with lines
if (char.IsAsciiLetterLower(valueChar))
valueChar = (char)(valueChar - 0x20);
if (char.IsAsciiLetterLower(targetChar))
targetChar = (char)(targetChar - 0x20);
but they have this comment at the top:
/// Duplicate of IndexOfOrdinalHelper that also handles ignore case. Can't converge both methods
/// as the JIT wouldn't be able to optimize the ignoreCase path away.
An attempt to merge Wasm + Icu functions is similar to the situation described. @ViktorHofer, you are the author of the comment, could you help us? Can you think about an effective way of refactoring e.g. IndexOfOrdinalIgnoreCaseHelper
and IndexOfOrdinalIgnoreCaseHelperJS
into one function, when they differ only with the contents of goto InteropCall
instruction?
} | ||
} | ||
|
||
private unsafe int IndexOfOrdinalHelperJS(out string exceptionMessage, ReadOnlySpan<char> source, ReadOnlySpan<char> target, CompareOptions options, int* matchLengthPtr, bool fromBeginning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this. Looks like most of the logic can be reused.
Planning to first fix it for main (#84695), then here. Converting to draft. |
Minor note (not important): The You could probably replace it with |
You are right, startsWith is correct in this context. We are using ~0.1% on Split currently (in comparison, not indexing, this implementation of indexing is very slow and needs to be re-designed). I don't see any impact on the length of |
export function segment_string_locale_sensitive(string: string, locale: string | undefined) : Intl.SegmentData[] | ||
{ | ||
const segmenter = new Intl.Segmenter(locale, { granularity: "grapheme" }); | ||
return Array.from(segmenter.segment(string)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return the enumerator instead. It would be allocating less objects on average, when we find earlier match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here: https://github.com/ilonatommy/runtime/tree/hg-indexing-faster. This PR actually should be already closed, not drafted. Sorry if it confused you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indexer is stll slow, thoug. For 64kB it's ~2100ms, which is unacceptable. That's why the branch is still WIP and not posted as a PR.
Implements a chunk of web-api based globalization. Is a part of HybridGlobalization feature and contributes to #79989.
Old, icu-based private API: GlobalizationNative_IndexOf, GlobalizationNative_LastIndexOf
New, non-icu private API: IndexOf.
Affected public API (see: tests in CompareInfoTests.IndexOf and CompareInfoTests.LastIndexOf.cs):
All changes in behavior are listed in docs\design\features\hybrid-globalization.md
cc @SamMonoRT