Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-21533 Lazily create the internal break iterator used in StringSearch & improve error handling #1473

Conversation

jefgen
Copy link
Member

@jefgen jefgen commented Nov 18, 2020

The usearch_* API has an internal character break iterator that it uses for handling combining characters or sequences.

However, the internal break iterator isn't always strictly needed or used though -- for example when matching with pure ASCII text. Creating the break iterator is a somewhat non-trivial operation though...

I did some profiling using ICU 64 on Windows with a test program that called the usearch APIs in a loop (calling usearch_open, then usearch_first), and on average, anywhere from 7-10% of the time was spent on just creating the break iterator (which wasn't actually ever used).

By lazily creating the break iterator when needed we can improve the performance of the usearch_* API for cases where it isn't used.

Additionally, I noticed that there were several places in the code that performed memory allocation(s) without any error checking for OOM. This PR also adds/improves the error checking as well.


Note: This change is ported from some performance changes that I made a few months ago in the branch here.
This PR is for this commit: microsoft/icu@5e4e9b5

I thought that splitting up the commits into separate PRs would be easier to review the changes.
Link to other PRs with perf changes: #1471 #1472

I made these changes while investigating some performance issues for .NET Core. The JIRA ticket has some more info on the observed performance improvements seen when running some of the .NET Core benchmarks (with all of the changes).
[cc: @tarekgh as FYI.]

Checklist

icu4c/source/i18n/usearch.cpp Outdated Show resolved Hide resolved
icu4c/source/i18n/usearch.cpp Outdated Show resolved Hide resolved
@jefgen jefgen force-pushed the jefgen/ICU-21388_Perf_Changes_Lazily_create_BreakIter_in_usearch branch from 14ab1be to d4c0066 Compare November 23, 2020 02:00
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/usearch.cpp is different
  • icu4c/source/i18n/usrchimp.h is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

icu4c/source/i18n/usearch.cpp Outdated Show resolved Hide resolved
icu4c/source/i18n/usearch.cpp Outdated Show resolved Hide resolved
icu4c/source/i18n/usearch.cpp Outdated Show resolved Hide resolved
icu4c/source/i18n/usearch.cpp Outdated Show resolved Hide resolved
icu4c/source/i18n/usearch.cpp Outdated Show resolved Hide resolved
icu4c/source/i18n/usearch.cpp Outdated Show resolved Hide resolved
icu4c/source/i18n/usearch.cpp Outdated Show resolved Hide resolved
icu4c/source/i18n/usearch.cpp Outdated Show resolved Hide resolved
icu4c/source/i18n/usearch.cpp Outdated Show resolved Hide resolved
icu4c/source/i18n/usearch.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@aheninger aheninger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fair number of the changes are to code that is excluded by #if BOYER_MOORE

Maybe we should think about removing it entirely; it's presence makes the active code much harder to see and follow, and, as you noted, it's been dead for ages. That should probably be done in a separate PR, though.

icu4c/source/i18n/usearch.cpp Outdated Show resolved Hide resolved
@markusicu
Copy link
Member

A fair number of the changes are to code that is excluded by #if BOYER_MOORE

Maybe we should think about removing it entirely; it's presence makes the active code much harder to see and follow, and, as you noted, it's been dead for ages. That should probably be done in a separate PR, though.

I am ok with doing that in a separate PR (before or after the lazy creation).

@markusicu
Copy link
Member

Also, Jeff, it would be nice if you could do the lazy creation in Java as well.

Copy link
Contributor

@aheninger aheninger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang Github; I didn't intend to approve in my previous comment.

@markusicu
Copy link
Member

Please modify this PR to use https://unicode-org.atlassian.net/browse/ICU-21533
Getting this into ICU 70 would be lovely. Maybe even early?

Base automatically changed from master to main March 24, 2021 17:56
@jefgen
Copy link
Member Author

jefgen commented Aug 18, 2021

Sorry for the long delay in getting back to this PR...

A fair number of the changes are to code that is excluded by #if BOYER_MOORE
Maybe we should think about removing it entirely; it's presence makes the active code much harder to see and follow, and, as you noted, it's been dead for ages. That should probably be done in a separate PR, though.

I am ok with doing that in a separate PR (before or after the lazy creation).

Thank you both. That sounds like a good idea. I'll file a separate ticket for doing that (https://unicode-org.atlassian.net/browse/ICU-21710), and will make that change first. Once that is done, I'll rebase the changes in this PR which should hopefully make things much cleaner.

@jefgen jefgen force-pushed the jefgen/ICU-21388_Perf_Changes_Lazily_create_BreakIter_in_usearch branch from d4c0066 to 507592c Compare August 30, 2021 23:32
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/usearch.cpp is different
  • icu4c/source/i18n/usrchimp.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jefgen
Copy link
Member Author

jefgen commented Aug 30, 2021

I forced-pushed an update to this PR after creating separate PRs for cleaning up the BOYER_MOORE code (#1800, #1830).
(I rebased on top of the latest changes from main, so there are no longer any changes for the dead BM code in this PR).
The updated PR also incorporates changes based on the feedback from Andy and Markus on the review.
Thank you both for the feedback!

@jefgen jefgen changed the title ICU-21388 Lazily create the internal break iterator used in StringSearch & improve error handling ICU-21533 Lazily create the internal break iterator used in StringSearch & improve error handling Sep 17, 2021
@jefgen
Copy link
Member Author

jefgen commented Sep 17, 2021

Ah, I just realized that the commit has the wrong ICU ticket number on it. I'll force push an update to fix it.

@jefgen jefgen force-pushed the jefgen/ICU-21388_Perf_Changes_Lazily_create_BreakIter_in_usearch branch from 507592c to 2d5ebd1 Compare September 17, 2021 00:24
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/usearch.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jefgen
Copy link
Member Author

jefgen commented Sep 17, 2021

I fixed the commit ticket number and rebased on top of the latest changes from main. PTAL. :)
Thank you!

@markusicu
Copy link
Member

@jefgen ping...

Also, Jeff, it would be nice if you could do the lazy creation in Java as well.

@jefgen
Copy link
Member Author

jefgen commented Sep 20, 2021

@jefgen ping...

Also, Jeff, it would be nice if you could do the lazy creation in Java as well.

Sigh. I don't think I'll be able to get to the lazy creation in Java by the ICU 70 code freeze... I'll file a follow-up ticket for ICU 71 for this.

EDIT: Filed https://unicode-org.atlassian.net/browse/ICU-21758

@markusicu
Copy link
Member

Hi @aheninger, you "requested changes" last November because of changes in obsolete Boyer Moore code which Jeff has since removed. Is it ok to dismiss the "request changes"?

@markusicu
Copy link
Member

@jefgen should we try to quickly discuss the question about error behavior? Do you want to just go with what you & Andy decided? Do you want to punt this once more to 71, and add the Java port?

@markusicu markusicu dismissed aheninger’s stale review September 22, 2021 17:43

old concerns addressed

@jefgen
Copy link
Member Author

jefgen commented Sep 22, 2021

@jefgen should we try to quickly discuss the question about error behavior? Do you want to just go with what you & Andy decided? Do you want to punt this once more to 71, and add the Java port?

FWIW, I'd sort of lean towards keeping the behavior as-is for the usearch_* APIs in this PR (setting "match not found" on error).
We could consider changing the behavior, but I'd prefer to do that as a separate ticket for ICU 71.

EDIT: To add, there was some discussion in the ICU-TC about this PR, but there was no strong opinion either way.

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm pse squash

@markusicu
Copy link
Member

FWIW, I'd sort of lean towards keeping the behavior as-is for the usearch_* APIs in this PR (setting "match not found" on error).
We could consider changing the behavior, but I'd prefer to do that as a separate ticket for ICU 71.

Works for me. Please submit a separate ticket for this, and add Peter, Andy, and myself to the "watchers" list.

…rch, and improve error handling.

Change NULL to nullptr.
@jefgen
Copy link
Member Author

jefgen commented Sep 22, 2021

FWIW, I'd sort of lean towards keeping the behavior as-is for the usearch_* APIs in this PR (setting "match not found" on error).
We could consider changing the behavior, but I'd prefer to do that as a separate ticket for ICU 71.

Works for me. Please submit a separate ticket for this, and add Peter, Andy, and myself to the "watchers" list.

Sure thing. I filed a new ticket here: https://unicode-org.atlassian.net/browse/ICU-21761

@jefgen jefgen force-pushed the jefgen/ICU-21388_Perf_Changes_Lazily_create_BreakIter_in_usearch branch from a24fb93 to 33f72d6 Compare September 22, 2021 20:48
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm tnx

@jefgen jefgen merged commit 0a47841 into unicode-org:main Sep 22, 2021
@jefgen jefgen deleted the jefgen/ICU-21388_Perf_Changes_Lazily_create_BreakIter_in_usearch branch September 22, 2021 22:25
@FrankYFTang
Copy link
Contributor

When I try ICU70-RC in chrome many blink unit test break because
usearch_next return -127 . I wonder would that be related to this PR.

The breaking code is in line 180 (the DCHECK_EQ code) of https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/core/editing/iterators/text_searcher_icu.cc;l=180;bpv=1


bool TextSearcherICU::NextMatchResultInternal(MatchResultICU& result) {
  UErrorCode status = U_ZERO_ERROR;
  const int match_start = usearch_next(searcher_, &status);
  DCHECK_EQ(status, U_ZERO_ERROR);

@FrankYFTang
Copy link
Contributor

I got failure in

    // Need to create the internal break iterator.
    strsrch->search->internalBreakIter = ubrk_open(UBRK_CHARACTER,
        ucol_getLocaleByType(strsrch->collator, ULOC_VALID_LOCALE, &status),
        strsrch->search->text, strsrch->search->textLength, &status);

@FrankYFTang
Copy link
Contributor

It try to create a BreakIteratore character instance with locale "en_US@collation=search" and fail with a status = -127 in ICU70-rc.

@jefgen
Copy link
Member Author

jefgen commented Oct 15, 2021

usearch_next return -127

Hm, if I understand correctly, -127 is a warning, not a failure.

U_USING_DEFAULT_WARNING = -127,

https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/utypes.h#L425

It looks like perhaps the chromium code is checking for exact equal with U_ZERO_ERROR.
(The line DCHECK_EQ(status, U_ZERO_ERROR);)

bool TextSearcherICU::NextMatchResultInternal(MatchResultICU& result) {
  UErrorCode status = U_ZERO_ERROR;
  const int match_start = usearch_next(searcher_, &status);
  DCHECK_EQ(status, U_ZERO_ERROR);

I think this might be too strict.

Looking at the call to usearch_open, it considers U_USING_DEFAULT_WARNING to be success:

  UStringSearch* searcher =
      usearch_open(&kNewlineCharacter, 1, &kNewlineCharacter, 1,
                   search_collator_name.Utf8().c_str(), nullptr, &status);
  DCHECK(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING ||
         status == U_USING_DEFAULT_WARNING)
      << status;
  return searcher;

https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/core/editing/iterators/text_searcher_icu.cc;l=53;bpv=1

Since the break iterator gets created lazily now, the result of U_USING_DEFAULT_WARNING status is moved from the call to usearch_open to usearch_next.

@FrankYFTang
Copy link
Contributor

looks like icu69-1 also create that break iterator with the same locale with -127

@FrankYFTang
Copy link
Contributor

I see. Probably I should change it to DCHECK(U_SUCCESS(status));

@jefgen
Copy link
Member Author

jefgen commented Oct 15, 2021

Probably something like this:

bool TextSearcherICU::NextMatchResultInternal(MatchResultICU& result) {
  UErrorCode status = U_ZERO_ERROR;
  const int match_start = usearch_next(searcher_, &status);
-  DCHECK_EQ(status, U_ZERO_ERROR);
+  DCHECK(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING ||
+         status == U_USING_DEFAULT_WARNING)
+     << status;

or maybe just check that U_SUCCESS(status) is true?

@FrankYFTang
Copy link
Contributor

Thanks. Basically we shift the return of U_USING_DEFAULT_WARNING in status from usearch_open() to usearch_next() from 69.1 to 70-rc

The cl to correct the test is in https://chromium-review.googlesource.com/c/chromium/src/+/3225401

pull bot pushed a commit to luojiguicai/chromium that referenced this pull request Oct 15, 2021
This correction is needed for ICU70 which lazy build the
internal break iterator instead of during constructor therefore
shift the return of U_USING_DEFAULT_WARNING in status
from usearch_open() to usearch_next(). Instead of explicitly
check different possible status code the DCHECK should just
use ICU macro U_SUCCESS() instead.

This change is needed to land to unblock the landing of ICU70
which have unicode-org/icu#1473
that shift the returning of U_USING_DEFAULT_WARNING in status.

Bug: 1260116
Change-Id: Ic1ed3dccc5d95e3fd1804321dc98fbe7255cff4d
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3225401
Commit-Queue: Kent Tamura <[email protected]>
Commit-Queue: Frank Tang <[email protected]>
Reviewed-by: Kent Tamura <[email protected]>
Cr-Commit-Position: refs/heads/main@{#931923}
@jefgen
Copy link
Member Author

jefgen commented Oct 15, 2021

Thanks. Basically we shift the return of U_USING_DEFAULT_WARNING in status from usearch_open() to usearch_next() from 69.1 to 70-rc

The cl to correct the test is in https://chromium-review.googlesource.com/c/chromium/src/+/3225401

Thanks @FrankYFTang!

pull bot pushed a commit to Mu-L/chromium that referenced this pull request Oct 16, 2021
This correction is needed for ICU70 which lazy build the
internal break iterator instead of during constructor therefore
shift the return of U_USING_DEFAULT_WARNING in status
from usearch_open() to usearch_next(). Instead of explicitly
check different possible status code the DCHECK should just
use ICU macro U_SUCCESS() instead.

This change is needed to land to unblock the landing of ICU70
which have unicode-org/icu#1473
that shift the returning of U_USING_DEFAULT_WARNING in status.

Bug: 1260116
Change-Id: Id78dae09d64d19d0c7840287033da75d93f7845a
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3225780
Reviewed-by: Lei Zhang <[email protected]>
Commit-Queue: Frank Tang <[email protected]>
Cr-Commit-Position: refs/heads/main@{#932111}
pull bot pushed a commit to ZSX-JOJO/chromium that referenced this pull request Oct 19, 2021
This correction is needed for ICU70 which lazy build the
internal break iterator instead of during constructor therefore
shift the return of U_USING_DEFAULT_WARNING in status
from usearch_open() to usearch_next(). Instead of explicitly
check different possible status code the DCHECK should just
use ICU macro U_SUCCESS() instead.

This change is needed to land to unblock the landing of ICU70
which have unicode-org/icu#1473
that shift the returning of U_USING_DEFAULT_WARNING in status.

Bug: 1260116
Change-Id: Ie5a2f402399b51120214a9ce8c8e4f5c249d0160
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3225781
Reviewed-by: Jungshik Shin <[email protected]>
Commit-Queue: Frank Tang <[email protected]>
Cr-Commit-Position: refs/heads/main@{#932782}
zeng450026937 pushed a commit to zeng450026937/base that referenced this pull request Oct 22, 2021
This correction is needed for ICU70 which lazy build the
internal break iterator instead of during constructor therefore
shift the return of U_USING_DEFAULT_WARNING in status
from usearch_open() to usearch_next(). Instead of explicitly
check different possible status code the DCHECK should just
use ICU macro U_SUCCESS() instead.

This change is needed to land to unblock the landing of ICU70
which have unicode-org/icu#1473
that shift the returning of U_USING_DEFAULT_WARNING in status.

Bug: 1260116
Change-Id: Ie5a2f402399b51120214a9ce8c8e4f5c249d0160
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3225781
Reviewed-by: Jungshik Shin <[email protected]>
Commit-Queue: Frank Tang <[email protected]>
Cr-Commit-Position: refs/heads/main@{#932782}
NOKEYCHECK=True
GitOrigin-RevId: 1556ec8248fe58e802191662a319c683fe52e3f1
qoor pushed a commit to qoor/basium that referenced this pull request Apr 7, 2022
This correction is needed for ICU70 which lazy build the
internal break iterator instead of during constructor therefore
shift the return of U_USING_DEFAULT_WARNING in status
from usearch_open() to usearch_next(). Instead of explicitly
check different possible status code the DCHECK should just
use ICU macro U_SUCCESS() instead.

This change is needed to land to unblock the landing of ICU70
which have unicode-org/icu#1473
that shift the returning of U_USING_DEFAULT_WARNING in status.

Bug: 1260116
Change-Id: Ie5a2f402399b51120214a9ce8c8e4f5c249d0160
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3225781
Reviewed-by: Jungshik Shin <[email protected]>
Commit-Queue: Frank Tang <[email protected]>
Cr-Commit-Position: refs/heads/main@{#932782}
mjfroman pushed a commit to mjfroman/moz-libwebrtc-third-party that referenced this pull request Oct 14, 2022
This correction is needed for ICU70 which lazy build the
internal break iterator instead of during constructor therefore
shift the return of U_USING_DEFAULT_WARNING in status
from usearch_open() to usearch_next(). Instead of explicitly
check different possible status code the DCHECK should just
use ICU macro U_SUCCESS() instead.

This change is needed to land to unblock the landing of ICU70
which have unicode-org/icu#1473
that shift the returning of U_USING_DEFAULT_WARNING in status.

Bug: 1260116
Change-Id: Ic1ed3dccc5d95e3fd1804321dc98fbe7255cff4d
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3225401
Commit-Queue: Kent Tamura <[email protected]>
Commit-Queue: Frank Tang <[email protected]>
Reviewed-by: Kent Tamura <[email protected]>
Cr-Commit-Position: refs/heads/main@{#931923}
NOKEYCHECK=True
GitOrigin-RevId: 1bc5bb4ba6c7922ad7dd57bd57cb5644c6efccd8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants