Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-21388 Use a read-only alias string in ucol_openElements and ucol_setText #1471

Conversation

jefgen
Copy link
Member

@jefgen jefgen commented Nov 18, 2020

The ICU APIs ucol_openElements and ucol_setText both are documented as requiring the caller to not modify or change the input string while the respective object is being used to iterate over the text.
(The API docs say "The caller must not modify or delete the text..." while they are being used. See https://github.com/unicode-org/icu/blob/master/icu4c/source/i18n/unicode/ucoleitr.h#L106 for example.)

In other words, they are documented as using a read-only alias of the input string from the caller.

Both functions already use the UnicodeString's read-only aliasing constructor for the input string arguments. However, some of the implementation methods that they call actually end up unfortunately making full copies of the input string, which hurts performance.

This change adds internal read-only alias methods to the CollationElementIterator and RuleBasedCollator classes, so that we can avoid making copies of the input string for the ucol_openElements and ucol_setText APIs.

This helps to improve the performance of not just the ucol_openElements and ucol_setText APIs, but also the usearch_* API, as it relies on the two previous APIs.


Note: This change is ported from some performance changes that I made a few months ago in the branch here.
This PR is for this commit: microsoft/icu@e6301ff

I thought that splitting up the commits into separate PRs would be easier to review the changes.
Link to other PRs with perf changes: #1472 #1473

I made these changes while investigating some performance issues for .NET Core. The JIRA ticket has some more info on the observed performance improvements seen when running some of the .NET Core benchmarks (with all of the changes).
[cc: @tarekgh as FYI.]

Checklist

…setText

The ICU APIs ucol_openElements and ucol_setText both are documented as
using a read-only alias of the input string from the caller.
(The API docs say that "The caller must not modify or delete the text..."
while they are being used.)

Both functions already use the UnicodeString's read-only aliasing constructor
for the input string arguments. However, the implementation methods that
they call actually end up making full copies of the input string, which
hurts performance.

This change adds internal read-only alias methods to the CollationElementIterator
and RuleBasedCollator classes, so that we can avoid making copies of the
input string for the ucol_openElements and ucol_setText APIs.
@tarekgh
Copy link

tarekgh commented Nov 18, 2020

@jefgen does this optimization can affect the scenario when using the Collator in multiple threads in same time?


In reply to: 729840245

@jefgen jefgen mentioned this pull request Nov 18, 2020
5 tasks
@jefgen
Copy link
Member Author

jefgen commented Nov 18, 2020

does this optimization can affect the scenario when using the Collator in multiple threads in same time?

The CollationElementIterator internally uses a const Collator:
https://github.com/unicode-org/icu/blob/master/icu4c/source/i18n/unicode/coleitr.h#L366

And the usearch_openFromCollator API also uses a const Collator as well:
https://github.com/unicode-org/icu/blob/master/icu4c/source/i18n/unicode/usearch.h

IIUC, I believe that const Collators can indeed be shared/used across multiple threads -- so long as only const methods are used on the Collators. i.e.: You can't share a Collator with other threads and and also mutate it at the same time.
(Based on reading the User Guide here: https://unicode-org.github.io/icu/userguide/design.html#thread-safe-const-apis).

So I think this optimization would apply regardless of if the Collator is used by itself, or with multiple threads.
(@tarekgh: I think that hopefully answers your question (?), but please let me know if I might have missed it).


In reply to: 730013353

// Note: This class predates the change in ICU 2.4, which made the UnicodeString's
// assignment operator and copy constructor always allocate and copy even for
// readonly aliases.
// We could possibly *always* use a readonly alias here with fastCopyFrom, but the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I am sure that the code was intended to take a string by reference/alias, but yes, this is what the code has been doing for a long time.

Could you please submit a ticket for new public API that only keeps a reference? We could call it setTextReference(). We should also propose a new RBC factory method that creates an iterator with not text at all.

@@ -272,12 +272,38 @@ void CollationElementIterator::setOffset(int32_t newOffset,
*/
void CollationElementIterator::setText(const UnicodeString& source,
UErrorCode& status)
{
setText(source, false, status);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should document on setText(string) that it makes a copy of the string. Assuming we can't safely change the behavior, you could do so in this PR.

Alternatively, you could submit a ticket for that and send an API proposal.

(I think it's safer to just document how it's been for a long time.)

@@ -1631,6 +1631,19 @@ RuleBasedCollator::createCollationElementIterator(const UnicodeString& source) c
return cei;
}

// Same as the above method, but uses a real-only alias of the input source string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind checking the inside-ICU call sites of the existing function? I suspect that they should all call this new one.

* @param status the error code status.
* @internal
*/
void setText(const UnicodeString &str, UBool readOnly, UErrorCode &status);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename this function to internalSetText(...).

Actually, I think it would be nicer to have separate functions, setText() which continues to make a copy, and internalSetTextReference(string, status) which does the fastCopyFrom() keeping the alias. More readable than a boolean parameter.

* @param order the collation object.
* @param status the error code status.
*/
CollationElementIterator(const UnicodeString& sourceText, UBool readOnly,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just add the readOnly parameter to the existing constructor? Too many call sites to modify? Might be a good opportunity to review them to see which needs what...

Another possibility is to not change the constructor nor add a new one at all. Instead, call sites could construct with an empty string and then call the new setter.

* the based Collator.
* @internal
*/
virtual CollationElementIterator* createCollationElementIteratorReadOnlyAlias(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please prefix the function name with "internal".

* the based Collator.
* @internal
*/
virtual CollationElementIterator* createCollationElementIteratorReadOnlyAlias(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be virtual? If it does, you can't HIDE it because you would create an incompatible vtable.

I think a good alternative is to not add anything here. Change call sites to pass in an empty string and then call the new text setter. We can look at optimizing a CEI with no string / an empty string.

@markusicu
Copy link
Member

FYI good catch for finding the unintended copy (even if we have to live with it in existing C++ API) -- thanks!

@markusicu
Copy link
Member

Please modify this PR to use https://unicode-org.atlassian.net/browse/ICU-21533
Getting this into ICU 70 would be lovely. Maybe even early?

Base automatically changed from master to main March 24, 2021 17:56
@markusicu
Copy link
Member

@jefgen ping :-)

Please modify this PR to use https://unicode-org.atlassian.net/browse/ICU-21533
Getting this into ICU 70 would be lovely. Maybe even early?

Note that you have another PR for ICU-21533. If that PR gets into ICU 70 but this PR slips past it, then we need to split the ticket again.

@jefgen
Copy link
Member Author

jefgen commented Sep 22, 2021

Thank you for all the feedback on this PR!

Note that you have another PR for ICU-21533. If that PR gets into ICU 70 but this PR slips past it, then we need to split the ticket again.

Thanks Markus. Unfortunately haven't had time to work on this... I'll need to split the ticket and create a new one for ICU 71.

EDIT: Created a new ticket here: https://unicode-org.atlassian.net/browse/ICU-21760

@markusicu
Copy link
Member

Hi @jefgen

Thank you for all the feedback on this PR!

Note that you have another PR for ICU-21533. If that PR gets into ICU 70 but this PR slips past it, then we need to split the ticket again.

Thanks Markus. Unfortunately haven't had time to work on this... I'll need to split the ticket and create a new one for ICU 71.

EDIT: Created a new ticket here: https://unicode-org.atlassian.net/browse/ICU-21760

Could you please retarget this PR for ICU-21760 and finish it for 71?

@jefgen
Copy link
Member Author

jefgen commented Feb 14, 2022

Could you please retarget this PR for ICU-21760 and finish it for 71?

Unfortunately, I don't think I'll have time to get this in for ICU 71.

Instead of keeping this PR open and punting it from release to release, I'm going to close this PR for now. There are no other commits on the ticket ICU-21760, so I'll move it "future", until there is more time to work on this.

@jefgen jefgen closed this Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants