-
-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU-21388 Use a read-only alias string in ucol_openElements and ucol_setText #1471
ICU-21388 Use a read-only alias string in ucol_openElements and ucol_setText #1471
Conversation
…setText The ICU APIs ucol_openElements and ucol_setText both are documented as using a read-only alias of the input string from the caller. (The API docs say that "The caller must not modify or delete the text..." while they are being used.) Both functions already use the UnicodeString's read-only aliasing constructor for the input string arguments. However, the implementation methods that they call actually end up making full copies of the input string, which hurts performance. This change adds internal read-only alias methods to the CollationElementIterator and RuleBasedCollator classes, so that we can avoid making copies of the input string for the ucol_openElements and ucol_setText APIs.
The And the IIUC, I believe that So I think this optimization would apply regardless of if the Collator is used by itself, or with multiple threads. In reply to: 730013353 |
// Note: This class predates the change in ICU 2.4, which made the UnicodeString's | ||
// assignment operator and copy constructor always allocate and copy even for | ||
// readonly aliases. | ||
// We could possibly *always* use a readonly alias here with fastCopyFrom, but the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. I am sure that the code was intended to take a string by reference/alias, but yes, this is what the code has been doing for a long time.
Could you please submit a ticket for new public API that only keeps a reference? We could call it setTextReference(). We should also propose a new RBC factory method that creates an iterator with not text at all.
@@ -272,12 +272,38 @@ void CollationElementIterator::setOffset(int32_t newOffset, | |||
*/ | |||
void CollationElementIterator::setText(const UnicodeString& source, | |||
UErrorCode& status) | |||
{ | |||
setText(source, false, status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should document on setText(string) that it makes a copy of the string. Assuming we can't safely change the behavior, you could do so in this PR.
Alternatively, you could submit a ticket for that and send an API proposal.
(I think it's safer to just document how it's been for a long time.)
@@ -1631,6 +1631,19 @@ RuleBasedCollator::createCollationElementIterator(const UnicodeString& source) c | |||
return cei; | |||
} | |||
|
|||
// Same as the above method, but uses a real-only alias of the input source string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind checking the inside-ICU call sites of the existing function? I suspect that they should all call this new one.
* @param status the error code status. | ||
* @internal | ||
*/ | ||
void setText(const UnicodeString &str, UBool readOnly, UErrorCode &status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename this function to internalSetText(...)
.
Actually, I think it would be nicer to have separate functions, setText()
which continues to make a copy, and internalSetTextReference(string, status)
which does the fastCopyFrom() keeping the alias. More readable than a boolean parameter.
* @param order the collation object. | ||
* @param status the error code status. | ||
*/ | ||
CollationElementIterator(const UnicodeString& sourceText, UBool readOnly, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just add the readOnly parameter to the existing constructor? Too many call sites to modify? Might be a good opportunity to review them to see which needs what...
Another possibility is to not change the constructor nor add a new one at all. Instead, call sites could construct with an empty string and then call the new setter.
* the based Collator. | ||
* @internal | ||
*/ | ||
virtual CollationElementIterator* createCollationElementIteratorReadOnlyAlias( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please prefix the function name with "internal".
* the based Collator. | ||
* @internal | ||
*/ | ||
virtual CollationElementIterator* createCollationElementIteratorReadOnlyAlias( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it have to be virtual? If it does, you can't HIDE it because you would create an incompatible vtable.
I think a good alternative is to not add anything here. Change call sites to pass in an empty string and then call the new text setter. We can look at optimizing a CEI with no string / an empty string.
FYI good catch for finding the unintended copy (even if we have to live with it in existing C++ API) -- thanks! |
Please modify this PR to use https://unicode-org.atlassian.net/browse/ICU-21533 |
@jefgen ping :-)
Note that you have another PR for ICU-21533. If that PR gets into ICU 70 but this PR slips past it, then we need to split the ticket again. |
Thank you for all the feedback on this PR!
Thanks Markus. Unfortunately haven't had time to work on this... I'll need to split the ticket and create a new one for ICU 71. EDIT: Created a new ticket here: https://unicode-org.atlassian.net/browse/ICU-21760 |
Hi @jefgen
Could you please retarget this PR for ICU-21760 and finish it for 71? |
Unfortunately, I don't think I'll have time to get this in for ICU 71. Instead of keeping this PR open and punting it from release to release, I'm going to close this PR for now. There are no other commits on the ticket ICU-21760, so I'll move it "future", until there is more time to work on this. |
The ICU APIs
ucol_openElements
anducol_setText
both are documented as requiring the caller to not modify or change the input string while the respective object is being used to iterate over the text.(The API docs say "The caller must not modify or delete the text..." while they are being used. See https://github.com/unicode-org/icu/blob/master/icu4c/source/i18n/unicode/ucoleitr.h#L106 for example.)
In other words, they are documented as using a read-only alias of the input string from the caller.
Both functions already use the
UnicodeString
's read-only aliasing constructor for the input string arguments. However, some of the implementation methods that they call actually end up unfortunately making full copies of the input string, which hurts performance.This change adds internal read-only alias methods to the
CollationElementIterator
andRuleBasedCollator
classes, so that we can avoid making copies of the input string for theucol_openElements
anducol_setText
APIs.This helps to improve the performance of not just the
ucol_openElements
anducol_setText
APIs, but also theusearch_*
API, as it relies on the two previous APIs.Note: This change is ported from some performance changes that I made a few months ago in the branch here.
This PR is for this commit: microsoft/icu@e6301ff
I thought that splitting up the commits into separate PRs would be easier to review the changes.
Link to other PRs with perf changes: #1472 #1473
I made these changes while investigating some performance issues for .NET Core. The JIRA ticket has some more info on the observed performance improvements seen when running some of the .NET Core benchmarks (with all of the changes).
[cc: @tarekgh as FYI.]
Checklist