-
-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU-21533 Lazily create the internal break iterator used in StringSearch & improve error handling #1473
ICU-21533 Lazily create the internal break iterator used in StringSearch & improve error handling #1473
Conversation
14ab1be
to
d4c0066
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fair number of the changes are to code that is excluded by #if BOYER_MOORE
Maybe we should think about removing it entirely; it's presence makes the active code much harder to see and follow, and, as you noted, it's been dead for ages. That should probably be done in a separate PR, though.
I am ok with doing that in a separate PR (before or after the lazy creation). |
Also, Jeff, it would be nice if you could do the lazy creation in Java as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dang Github; I didn't intend to approve in my previous comment.
Please modify this PR to use https://unicode-org.atlassian.net/browse/ICU-21533 |
Sorry for the long delay in getting back to this PR...
Thank you both. That sounds like a good idea. I'll file a separate ticket for doing that (https://unicode-org.atlassian.net/browse/ICU-21710), and will make that change first. Once that is done, I'll rebase the changes in this PR which should hopefully make things much cleaner. |
d4c0066
to
507592c
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
I forced-pushed an update to this PR after creating separate PRs for cleaning up the BOYER_MOORE code (#1800, #1830). |
Ah, I just realized that the commit has the wrong ICU ticket number on it. I'll force push an update to fix it. |
507592c
to
2d5ebd1
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
I fixed the commit ticket number and rebased on top of the latest changes from |
@jefgen ping...
|
Sigh. I don't think I'll be able to get to the lazy creation in Java by the ICU 70 code freeze... I'll file a follow-up ticket for ICU 71 for this. EDIT: Filed https://unicode-org.atlassian.net/browse/ICU-21758 |
Hi @aheninger, you "requested changes" last November because of changes in obsolete Boyer Moore code which Jeff has since removed. Is it ok to dismiss the "request changes"? |
@jefgen should we try to quickly discuss the question about error behavior? Do you want to just go with what you & Andy decided? Do you want to punt this once more to 71, and add the Java port? |
FWIW, I'd sort of lean towards keeping the behavior as-is for the EDIT: To add, there was some discussion in the ICU-TC about this PR, but there was no strong opinion either way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm pse squash
Works for me. Please submit a separate ticket for this, and add Peter, Andy, and myself to the "watchers" list. |
…rch, and improve error handling. Change NULL to nullptr.
Sure thing. I filed a new ticket here: https://unicode-org.atlassian.net/browse/ICU-21761 |
a24fb93
to
33f72d6
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm tnx
When I try ICU70-RC in chrome many blink unit test break because The breaking code is in line 180 (the DCHECK_EQ code) of https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/core/editing/iterators/text_searcher_icu.cc;l=180;bpv=1
|
I got failure in
|
It try to create a BreakIteratore character instance with locale "en_US@collation=search" and fail with a status = -127 in ICU70-rc. |
Hm, if I understand correctly, -127 is a warning, not a failure.
https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/utypes.h#L425 It looks like perhaps the chromium code is checking for exact equal with
I think this might be too strict. Looking at the call to UStringSearch* searcher =
usearch_open(&kNewlineCharacter, 1, &kNewlineCharacter, 1,
search_collator_name.Utf8().c_str(), nullptr, &status);
DCHECK(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING ||
status == U_USING_DEFAULT_WARNING)
<< status;
return searcher; Since the break iterator gets created lazily now, the result of |
looks like icu69-1 also create that break iterator with the same locale with -127 |
I see. Probably I should change it to DCHECK(U_SUCCESS(status)); |
Probably something like this: bool TextSearcherICU::NextMatchResultInternal(MatchResultICU& result) {
UErrorCode status = U_ZERO_ERROR;
const int match_start = usearch_next(searcher_, &status);
- DCHECK_EQ(status, U_ZERO_ERROR);
+ DCHECK(status == U_ZERO_ERROR || status == U_USING_FALLBACK_WARNING ||
+ status == U_USING_DEFAULT_WARNING)
+ << status; or maybe just check that U_SUCCESS(status) is true? |
Thanks. Basically we shift the return of U_USING_DEFAULT_WARNING in status from usearch_open() to usearch_next() from 69.1 to 70-rc The cl to correct the test is in https://chromium-review.googlesource.com/c/chromium/src/+/3225401 |
This correction is needed for ICU70 which lazy build the internal break iterator instead of during constructor therefore shift the return of U_USING_DEFAULT_WARNING in status from usearch_open() to usearch_next(). Instead of explicitly check different possible status code the DCHECK should just use ICU macro U_SUCCESS() instead. This change is needed to land to unblock the landing of ICU70 which have unicode-org/icu#1473 that shift the returning of U_USING_DEFAULT_WARNING in status. Bug: 1260116 Change-Id: Ic1ed3dccc5d95e3fd1804321dc98fbe7255cff4d Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3225401 Commit-Queue: Kent Tamura <[email protected]> Commit-Queue: Frank Tang <[email protected]> Reviewed-by: Kent Tamura <[email protected]> Cr-Commit-Position: refs/heads/main@{#931923}
Thanks @FrankYFTang! |
This correction is needed for ICU70 which lazy build the internal break iterator instead of during constructor therefore shift the return of U_USING_DEFAULT_WARNING in status from usearch_open() to usearch_next(). Instead of explicitly check different possible status code the DCHECK should just use ICU macro U_SUCCESS() instead. This change is needed to land to unblock the landing of ICU70 which have unicode-org/icu#1473 that shift the returning of U_USING_DEFAULT_WARNING in status. Bug: 1260116 Change-Id: Id78dae09d64d19d0c7840287033da75d93f7845a Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3225780 Reviewed-by: Lei Zhang <[email protected]> Commit-Queue: Frank Tang <[email protected]> Cr-Commit-Position: refs/heads/main@{#932111}
This correction is needed for ICU70 which lazy build the internal break iterator instead of during constructor therefore shift the return of U_USING_DEFAULT_WARNING in status from usearch_open() to usearch_next(). Instead of explicitly check different possible status code the DCHECK should just use ICU macro U_SUCCESS() instead. This change is needed to land to unblock the landing of ICU70 which have unicode-org/icu#1473 that shift the returning of U_USING_DEFAULT_WARNING in status. Bug: 1260116 Change-Id: Ie5a2f402399b51120214a9ce8c8e4f5c249d0160 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3225781 Reviewed-by: Jungshik Shin <[email protected]> Commit-Queue: Frank Tang <[email protected]> Cr-Commit-Position: refs/heads/main@{#932782}
This correction is needed for ICU70 which lazy build the internal break iterator instead of during constructor therefore shift the return of U_USING_DEFAULT_WARNING in status from usearch_open() to usearch_next(). Instead of explicitly check different possible status code the DCHECK should just use ICU macro U_SUCCESS() instead. This change is needed to land to unblock the landing of ICU70 which have unicode-org/icu#1473 that shift the returning of U_USING_DEFAULT_WARNING in status. Bug: 1260116 Change-Id: Ie5a2f402399b51120214a9ce8c8e4f5c249d0160 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3225781 Reviewed-by: Jungshik Shin <[email protected]> Commit-Queue: Frank Tang <[email protected]> Cr-Commit-Position: refs/heads/main@{#932782} NOKEYCHECK=True GitOrigin-RevId: 1556ec8248fe58e802191662a319c683fe52e3f1
This correction is needed for ICU70 which lazy build the internal break iterator instead of during constructor therefore shift the return of U_USING_DEFAULT_WARNING in status from usearch_open() to usearch_next(). Instead of explicitly check different possible status code the DCHECK should just use ICU macro U_SUCCESS() instead. This change is needed to land to unblock the landing of ICU70 which have unicode-org/icu#1473 that shift the returning of U_USING_DEFAULT_WARNING in status. Bug: 1260116 Change-Id: Ie5a2f402399b51120214a9ce8c8e4f5c249d0160 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3225781 Reviewed-by: Jungshik Shin <[email protected]> Commit-Queue: Frank Tang <[email protected]> Cr-Commit-Position: refs/heads/main@{#932782}
This correction is needed for ICU70 which lazy build the internal break iterator instead of during constructor therefore shift the return of U_USING_DEFAULT_WARNING in status from usearch_open() to usearch_next(). Instead of explicitly check different possible status code the DCHECK should just use ICU macro U_SUCCESS() instead. This change is needed to land to unblock the landing of ICU70 which have unicode-org/icu#1473 that shift the returning of U_USING_DEFAULT_WARNING in status. Bug: 1260116 Change-Id: Ic1ed3dccc5d95e3fd1804321dc98fbe7255cff4d Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3225401 Commit-Queue: Kent Tamura <[email protected]> Commit-Queue: Frank Tang <[email protected]> Reviewed-by: Kent Tamura <[email protected]> Cr-Commit-Position: refs/heads/main@{#931923} NOKEYCHECK=True GitOrigin-RevId: 1bc5bb4ba6c7922ad7dd57bd57cb5644c6efccd8
The
usearch_*
API has an internal character break iterator that it uses for handling combining characters or sequences.However, the internal break iterator isn't always strictly needed or used though -- for example when matching with pure ASCII text. Creating the break iterator is a somewhat non-trivial operation though...
I did some profiling using ICU 64 on Windows with a test program that called the usearch APIs in a loop (calling
usearch_open
, thenusearch_first
), and on average, anywhere from 7-10% of the time was spent on just creating the break iterator (which wasn't actually ever used).By lazily creating the break iterator when needed we can improve the performance of the
usearch_*
API for cases where it isn't used.Additionally, I noticed that there were several places in the code that performed memory allocation(s) without any error checking for OOM. This PR also adds/improves the error checking as well.
Note: This change is ported from some performance changes that I made a few months ago in the branch here.
This PR is for this commit: microsoft/icu@5e4e9b5
I thought that splitting up the commits into separate PRs would be easier to review the changes.
Link to other PRs with perf changes: #1471 #1472
I made these changes while investigating some performance issues for .NET Core. The JIRA ticket has some more info on the observed performance improvements seen when running some of the .NET Core benchmarks (with all of the changes).
[cc: @tarekgh as FYI.]
Checklist