Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON) #415

jjesusfilho · 2021-03-26T11:33:06Z

According to the Unicode Technical Standard #35, when case Level is set to On and strength set to primary, ICU should ignore accents but take case into account. This is not the behavior I find in stringi package. To achieve that, I have to set locale = "", which is not what I expected. Examples:

This is ok:

stringi::stri_detect_coll("Mário", "mario", strength = 1L)
[1] TRUE

This should return FALSE:

stringi::stri_detect_coll("Mário", "mario", strength = 1L,case_level = TRUE)
[1] TRUE

In order to achieve that I need to set locale to "" on stringi:

stringi::stri_detect_coll("Mário", "mario", strength = 1L, locale = "")
[1] FALSE

Here I respect case but don't add accent and get the expected result:

stringi::stri_detect_coll("Mário", "Mario", strength = 1L, locale = "")
[1] TRUE

The text was updated successfully, but these errors were encountered:

gagolews · 2021-03-28T22:41:11Z

Could you please paste the output of a call to stringi::stri_info() here? Thank you.

gagolews · 2021-03-28T22:45:47Z

I wonder if strength=2 is what you might need:

stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE, locale="pt_BR")
## [1]  TRUE  TRUE FALSE FALSE
stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE)
## [1]  TRUE  TRUE FALSE FALSE

jjesusfilho · 2021-04-02T11:16:56Z

Could you please paste the output of a call to stringi::stri_info() here? Thank you.

$Unicode.version
[1] "10.0"

$ICU.version
[1] "61.1"

$Locale
$Locale$Language
[1] "pt"

$Locale$Country
[1] "BR"

$Locale$Variant
[1] ""

$Locale$Name
[1] "pt_BR"

$Charset.internal
[1] "UTF-8" "UTF-16"

$Charset.native
$Charset.native$Name.friendly
[1] "UTF-8"

$Charset.native$Name.ICU
[1] "UTF-8"

$Charset.native$Name.UTR22
[1] NA

$Charset.native$Name.IBM
[1] "ibm-1208"

$Charset.native$Name.WINDOWS
[1] "windows-65001"

$Charset.native$Name.JAVA
[1] "UTF-8"

$Charset.native$Name.IANA
[1] "UTF-8"

$Charset.native$Name.MIME
[1] "UTF-8"

$Charset.native$ASCII.subset
[1] TRUE

$Charset.native$Unicode.1to1
[1] NA

$Charset.native$CharSize.8bit
[1] FALSE

$Charset.native$CharSize.min
[1] 1

$Charset.native$CharSize.max
[1] 3

$ICU.system
[1] FALSE

$ICU.UTF8
[1] FALSE

jjesusfilho · 2021-04-02T11:30:26Z

I wonder if strength=2 is what you might need:

stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE, locale="pt_BR")
## [1]  TRUE  TRUE FALSE FALSE
stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE)
## [1]  TRUE  TRUE FALSE FALSE

Actually, what I want is to ignore accents, but not case. According to the ICU documentation mentioned above, I should achieve that by setting strength to level 1 and case level to On. I also work with PostgreSQL and it works accordingly, but with stringi, I can only get the same result by setting locale to '"", regardless of case_level's setting.

stringi::stri_detect_coll(c("Mário","mário"), "mario", strength=1L,  locale = '')
[1] FALSE  TRUE

gagolews · 2021-04-07T04:19:42Z

First of all, thanks, there was a bug; locale="" should mean locale=NULL, i.e., your own locale, pt_BR.

gagolews · 2021-04-07T04:21:03Z

instead, locale='' meant locale="POSIX" - that is why it worked as expected (and perhaps this is what postgresql uses, hence the correct results). I would recommend setting locale="POSIX" explicitly then.

stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 1L, locale="POSIX")
## [1] FALSE  TRUE FALSE  TRUE

jjesusfilho · 2021-04-14T22:52:50Z

Yes, but I still believe that it can get a bit confusing because, in order to ignore accents but not case, in stringi, you have to set the locale, but not the case_level, when this is not the way PostgreSQL works and probably other languages, i.e, you must have the case level turned on along with the primary strength. This is also what the ICU documentation says:

"The case level is independent from the strength of comparison. It is possible to have a collator set to 
 primary strength with the case level turned on. This provides for comparison that takes into account the 
 case differences, while at the same time ignoring accents and tertiary differences other than case"

gagolews · 2021-04-20T00:10:23Z

Hmmm.... interestingly, a collator-based string comparison honours the above rule...

> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=2L)
[1] FALSE  TRUE FALSE FALSE
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)
[1] FALSE  TRUE FALSE  TRUE
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", strength=1L)
[1] TRUE TRUE TRUE TRUE
> stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", strength=2L)
[1]  TRUE  TRUE FALSE FALSE

gagolews · 2021-04-20T01:33:07Z

I was trying hard to figure out why usearch returns a different result below, but with no success. A bug in ICU?

stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)
## [1] TRUE TRUE TRUE TRUE
stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)
## [1] FALSE  TRUE FALSE  TRUE

gagolews · 2021-05-04T06:06:31Z

(note to self): ICU 69.1 gives the results as above. @todo: create a minimal reproducible example outside of stringi

gagolews · 2021-05-14T05:06:13Z

[note to self]

Yes, this is reproducible outside of stringi:

/* g++ -std=c++11 icu_test_bug_ucol_caselevel.cpp -licui18n -licuuc -licudata  && ./a.out */

#include <unicode/ustring.h>
#include <unicode/rbbi.h>
#include <unicode/coll.h>
#include <unicode/ucol.h>
#include <unicode/stsearch.h>
#include <cstdio>
using namespace icu;


#define WITH_CHECK_STATUS(f) \
    status = U_ZERO_ERROR; \
    f; \
    if (U_FAILURE(status)) {printf("error %s!\n", u_errorName(status));return 1;}


int main()
{
    const char* haystacks[] = {
        "mario", "Mario", "MARIO", "MÁRIO", "Mário", "mário", "dario"
    };

    const char* needle = "mario";


    UErrorCode status;
    UCollator* col;
    UStringSearch* matcher;
    int v;

    printf("U_ICU_VERSION=%s\n", U_ICU_VERSION);

    WITH_CHECK_STATUS(col = ucol_open("pt_BR", &status))

    WITH_CHECK_STATUS(ucol_setAttribute(col, UCOL_STRENGTH, UCOL_PRIMARY, &status))
    WITH_CHECK_STATUS(ucol_setAttribute(col, UCOL_CASE_LEVEL, UCOL_ON, &status))

    UnicodeString _needle = UnicodeString::fromUTF8(needle);
    UnicodeString haystack = UnicodeString::fromUTF8("whatever");

    WITH_CHECK_STATUS(matcher = usearch_openFromCollator(
        _needle.getBuffer(),
        _needle.length(),
        haystack.getBuffer(),
        haystack.length(),
        col, NULL, &status
    ))

    for (int i=0; i<sizeof(haystacks)/sizeof(haystacks[0]); ++i) {
        printf("%s vs. %s: ", needle, haystacks[i]);
        haystack = UnicodeString::fromUTF8(haystacks[i]);

        WITH_CHECK_STATUS(
            usearch_setText(matcher, haystack.getBuffer(), haystack.length(), &status)
        )

        usearch_reset(matcher);

        WITH_CHECK_STATUS(v = ((int)USEARCH_DONE!=usearch_first(matcher, &status)))
        printf(" usearch=%d", v);

        WITH_CHECK_STATUS(v = (int)UCOL_EQUAL==ucol_strcollUTF8(col,
            haystacks[i], -1, needle, -1, &status))
        printf(" ucol=%d", v);

        RuleBasedCollator* rbc = RuleBasedCollator::rbcFromUCollator(col);
        WITH_CHECK_STATUS(StringSearch matcher2(needle, haystack, rbc, NULL, status))
        WITH_CHECK_STATUS(v = ((int)USEARCH_DONE!=matcher2.first(status)))
        printf(" StringSearch=%d", v);

        printf("\n");
    }

    ucol_close(col);
    usearch_close(matcher);

    return 0;
}

yields:

U_ICU_VERSION=67.1
mario vs. mario:  usearch=1 ucol=1 StringSearch=1
mario vs. Mario:  usearch=1 ucol=0 StringSearch=1
mario vs. MARIO:  usearch=1 ucol=0 StringSearch=1
mario vs. MÁRIO:  usearch=1 ucol=0 StringSearch=1
mario vs. Mário:  usearch=1 ucol=0 StringSearch=1
mario vs. mário:  usearch=1 ucol=1 StringSearch=1
mario vs. dario:  usearch=0 ucol=0 StringSearch=0

gagolews · 2021-05-14T05:13:53Z

All right, it turns out that this issue has already been reported. It is ICU-related.

https://unicode-org.atlassian.net/browse/ICU-21338

jjesusfilho · 2021-05-14T11:19:09Z

Thank you very much for taking your time to work this out. I have followed your thread posts and tried to find the issue on ICU's bug page, but couldn't figure it out.

gagolews added a commit that referenced this issue Apr 7, 2021

bugfix #415

b413535

gagolews added the bug label May 5, 2021

gagolews added this to the stringi-1.7 milestone May 5, 2021

gagolews added the ICU-related label May 14, 2021

gagolews changed the title ~~Collation: case_level doesn't seem to work properly to ignore accents but take case into account~~ Collation: case_level doesn't seem to work properly to ignore accents but take case into account (UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON) May 14, 2021

gagolews added a commit that referenced this issue May 14, 2021

#415 bug in ICU USearch reproducible example

0bc2ffc

gagolews modified the milestones: stringi-1.7, stringi-1.8 Jun 20, 2021

gagolews removed this from the stringi-1.8 milestone Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON) #415

Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON) #415

jjesusfilho commented Mar 26, 2021 •

edited

Loading

gagolews commented Mar 28, 2021

gagolews commented Mar 28, 2021 •

edited

Loading

jjesusfilho commented Apr 2, 2021

jjesusfilho commented Apr 2, 2021 •

edited

Loading

gagolews commented Apr 7, 2021

gagolews commented Apr 7, 2021

jjesusfilho commented Apr 14, 2021 •

edited

Loading

gagolews commented Apr 20, 2021

gagolews commented Apr 20, 2021

gagolews commented May 4, 2021 •

edited

Loading

gagolews commented May 14, 2021

gagolews commented May 14, 2021

jjesusfilho commented May 14, 2021

Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON) #415

Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON) #415

Comments

jjesusfilho commented Mar 26, 2021 • edited Loading

gagolews commented Mar 28, 2021

gagolews commented Mar 28, 2021 • edited Loading

jjesusfilho commented Apr 2, 2021

jjesusfilho commented Apr 2, 2021 • edited Loading

gagolews commented Apr 7, 2021

gagolews commented Apr 7, 2021

jjesusfilho commented Apr 14, 2021 • edited Loading

gagolews commented Apr 20, 2021

gagolews commented Apr 20, 2021

gagolews commented May 4, 2021 • edited Loading

gagolews commented May 14, 2021

gagolews commented May 14, 2021

jjesusfilho commented May 14, 2021

jjesusfilho commented Mar 26, 2021 •

edited

Loading

gagolews commented Mar 28, 2021 •

edited

Loading

jjesusfilho commented Apr 2, 2021 •

edited

Loading

jjesusfilho commented Apr 14, 2021 •

edited

Loading

gagolews commented May 4, 2021 •

edited

Loading