Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-21710 Remove BOYER_MOORE dead code from usearch.cpp #1800

Merged

Conversation

jefgen
Copy link
Member

@jefgen jefgen commented Aug 18, 2021

This is related to the changes in PR #1473.

In ICU 4.0 (see ticket ICU-5420), the ICU4C StringSearch implementation was changed to use a linear search algorithm, and no longer used the Boyer-Moore search algorithm.
In ICU 51.1 (see ticket ICU-9573), the Boyer-Moore string search implementation and test cases were retracted.
In ICU 53.1 (see ticket ICU-6288), the ICU4C StringSearch changes were ported to ICU4J, meaning that ICU4J now also used a linear search algorithm.

The Boyer-Moore search algorithm code in usearch.cpp hasn't actually compiled since the Collation V2 rewrite in ~2013/2014.
(The code behind the BOYER_MOORE #ifdef refers to non-existent functions and methods that have been changed, making it no longer compile if set to 1.)

This change completely removes the Boyer-Moore search algorithm code from the file usearch.cpp.

Note: The diff shown on GitHub is slightly confused. I didn't actually edit or change the code in the file -- I only made deletions/removals.

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-21710
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@jefgen jefgen force-pushed the user/jefgen/remove-Boyer-Moore-from-usearch branch from 3d1f058 to 6f61c31 Compare August 18, 2021 20:41
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • docs/userguide/icu4j/faq.md is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@markusicu markusicu self-assigned this Aug 18, 2021
@markusicu markusicu requested a review from aheninger August 18, 2021 22:47
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice getting rid of the non-functional half of the .cpp file!

@@ -195,7 +195,9 @@ determine whether case and accents are ignored during a search.

#### What algorithm are you using to perform the search?

StringSearch uses a version of the Boyer-Moore search algorithm that has been
As of ICU 53, StringSearch uses a simple linear search algorithm which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it say ICU 53 here, but ICU 4.0 in usearch.h?
I suspect that ICU 4.0 is closer to when B-M was turned off-by-default for hard-to-fix problems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this to have a different date for ICU4J, as it looks like the J changes occurred after the C changes.

Based on looking at the tickets and code history, it appears as though B-M was turned off in ICU4C first, and later ported to ICU4J a few releases later.

The ICU4C changes to turn off B-M were done using ticket ICU-5420, which was marked for 4.0. This also aligns with existing comments in the User Guide:

In ICU4C 4.0, the string search service was updated with the simple linear search algorithm, which locates a match by shifting a cursor in the target text one by one, and these issues were fixed. In ICU4C 4.0.1, the Boyer-Moore search code was reintroduced as a separated API set as a technology preview. In a later release, this code was deleted.

The removal of the additional B-M implementation was done in ICU 51.1 using ticket ICU-9573.

And then later in ICU 53.1 (using ticket ICU-6288), the ICU4C StringSearch changes were ported to ICU4J.

This also was noted on the download page for ICU 53 http://site.icu-project.org/download/53

ICU4J Specific Changes

  • StringSearch code re-implemented
    • Ported ICU4C implementation to ICU4J, replacing the previous implementation

So I thought it might be good to update the documentation to call out that the J changes occurred later on (in version 53).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking!
This is confusing if we just say "ICU 53 StringSearch", even in the Java FAQ.

Please change this here to say "As of ICU4J 53 / ICU4C 4.0, ..."
and then usearch.h to say "As of ICU4C 4.0 / ICU4J 53, ..."
and something similar in the User Guide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
I'll force push an update with the changes in a moment.

@jefgen jefgen force-pushed the user/jefgen/remove-Boyer-Moore-from-usearch branch from 6f61c31 to a9af611 Compare August 19, 2021 00:04
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • docs/userguide/collation/examples.md is now changed in the branch
  • docs/userguide/collation/string-search.md is now changed in the branch
  • docs/userguide/icu4j/faq.md is different
  • icu4c/source/i18n/unicode/usearch.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jefgen jefgen force-pushed the user/jefgen/remove-Boyer-Moore-from-usearch branch from a9af611 to e860e1b Compare August 24, 2021 03:30
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • docs/userguide/collation/string-search.md is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm tnx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants