Searching for Arabic words issue - #4784 #5099

Ashamandi · 2017-11-28T13:18:01Z

Hello @marijnh, this PR to fix ##4784
at this commit I added two functions one for normalize Arabic strings and the other to check ifArabic

I am expecting to apply the normalization on the search input and the doc itself .. Could you help me figure out where exactly can I apply it to normalize the search input and the whole doc itself before start searching?

marijnh · 2017-11-28T13:31:10Z

Could you help me figure out where exactly can I apply it to normalize the search input and the whole doc itself before start searching?

I don't understand this question—why do you need to normalize before you start searching?

It appears that the normalization is strickly 'expanding' (never shortens the length of a string) after all, so that's good. The code looks good except that you're turning off case- and unicode-normalization when there's Arabic in the string. The linter appears to be complaining about trailing whitespace.

Ashamandi · 2017-12-05T13:01:45Z

I don't understand this question—why do you need to normalize before you start searching?

what I think about is normalizing both search input and the whole doc to put them into same char shaping before start searching so the searching can detect all matches

as an example :

What I imagine solving that is
1- Normalize search input أيه to "ايه"
2- Normalize the whole doc so all its 4 words will be converted to "ايه" as well
3- Run searching function
the output will be as shown before.

but when I applied it on noFold and doFold only this didn`t solve the problem, the normalization done but not all matches detected

marijnh · 2017-12-05T14:19:42Z

so the searching can detect all matches

You mean for the purpose of highlighting them? That is done in searchOverlay in addon/search/search.js, which currently doesn't really do any normalization beyond case, for which it is relying on the regexp i flag. To support normalization there, we'd have to use a different approach.

Can we start by landing this change just for the search cursor, and then look into the highlighting issue separately?

Ashamandi · 2017-12-05T19:33:56Z

Can we start by landing this change just for the search cursor, and then look into the highlighting issue separately?

Sure ... no problem for that.

The code looks good except that you're turning off case- and unicode-normalization when there's Arabic in the string.

Sorry didn`t get what you mean here, Do you mean that you prefer chars instead of its unicodes like that

//Normalization for Arabic
  var normalizeArabicChars = function (s) {
    function filter(c) {
      switch (c) {
      // ALEF Chars
      case 'إ' :
      case 'أ' :
      case 'آ' :
      case 'ٵ' :
      case 'ٳ' :
      case 'ٲ' :
      case 'ٱ' :
        return 'ا'
           .
           .
      default :
        return c
    }

marijnh · 2017-12-06T08:52:40Z

Sorry didn`t get what you mean here

In this line (and the one after it)...

doFold = function(str) { return isArabic(str)? normalizeArabicChars(str) : str.normalize("NFD").toLowerCase() }

... toLowerCase won't be called for a string like "XYايه", but it should (in addition to normalizeArarabicChars and .normalize)

Ashamandi · 2017-12-06T09:51:24Z

ooooh ... I got what you mean, so I think there is no need to check isArabic before applying the normalization

if (String.prototype.normalize) {
    doFold = function(str) { return normalizeArabicChars(str.normalize("NFD").toLowerCase()) }
    noFold = function(str) { return normalizeArabicChars(str.normalize("NFD")) }

Will it be okey?

marijnh · 2017-12-06T09:53:58Z

Since normalizeArabicChars looks somewhat expensive, I'd prefer if you kept the check. Maybe something like str = str.normalize("NFD").toLowerCase(); if (isArabic(str)) str = normalizeArabicChars(str); return str?

marijnh · 2017-12-06T10:55:25Z

addon/search/searchcursor.js

+        case 'ٸ' :
+          return 'ي'
+        case 'ئ':
+          return 'ي ء'


Is this space in the returned string intentional?

Also, `.normalize("NFD") already separates this character into \u064a and \u0654, which seems similar to what you're doing, and might already cover this?

Thanks for comment .... yes, you are right, it seems that when I converted each Unicode into its char I added a space by mistake .... sorry for that.
About normalize("NFD"), Actually I tested it but it didn`t work , but for sure if you test it, I can remove this case to reduce checks.

marijnh · 2017-12-06T14:17:18Z

Is there a standard or document that your normalizations are based on? Whether something works or doesn't work depends on what you expect it to do—and if we're going to implement a normalization, I'd prefer for it to be based on some standard or at least widespread convention.

marijnh · 2017-12-06T14:33:45Z

Does the problem you are trying to solve also occur with diacritics encoded as separate characters (for example \u0654 "hamza above")? If so, we probably need something more general than these replacements—filtering out a whole character categories, such as accents and diacritics. I notice a lot, but not all, of the characters in your code are already decomposed to a base character plus an extending character when you call .normalize("NDF") on them, so that could be part of the solution.

But, I guess, not the entire solution. Is there some Unicode concept that covers the rest?

Ashamandi · 2017-12-07T09:34:53Z

Does the problem you are trying to solve also occur with diacritics encoded as separate characters (for example \u0654 "hamza above")?

No, my problem is limited to some chars that are considered equaled but has different Unicode chars
and these chars are AL-ALEF, AL-TAA AL-MARBOTA and AL-YAA.
and if we talk about Al-ALEF as an example: ا, أ , آ, إ ,.. we found that all the previous ALEF shapes should be detected when searching for any one of them.
in the same way when we take about TAAA MARBOTA and YAA different shapes.

marijnh · 2017-12-13T09:49:37Z

I asked around about whether these characters should be considered equal, and opinions differ (see this and this).

I'd like to add some kind of generalized stripping of extending characters (so that 'e' matches 'é' and 'ٵ' matches 'ا'), which would address some of the equivalences in this pull request. For things like 'ي' and 'ى', which are similar but not technically equivalent, I prefer not to have built-in normalization (since the set of these characters, across the various scripts, is huge, and there doesn't appear to be a standard that defines how to do this generally).

I am open to having an extension mechanism through which user code can add additional normalizations, which you could use to add the ones not covered by the extending-character-removal. Does that sound reasonable?

(It might be a while until I get around to implementing all this, unfortunately.)

add arabic normalization function

fde39be

handle style comments

9fca3ea

add normalizeArabicChars to doFold and noFold

06f207e

marijnh reviewed Dec 6, 2017

View reviewed changes

removing space between Al-HAMZAH and Al-YAA

576b990

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Searching for Arabic words issue - #4784 #5099

Searching for Arabic words issue - #4784 #5099

Ashamandi commented Nov 28, 2017

marijnh commented Nov 28, 2017

Ashamandi commented Dec 5, 2017

marijnh commented Dec 5, 2017

Ashamandi commented Dec 5, 2017 •

edited

Loading

marijnh commented Dec 6, 2017

Ashamandi commented Dec 6, 2017

marijnh commented Dec 6, 2017

marijnh Dec 6, 2017

marijnh Dec 6, 2017

Ashamandi Dec 6, 2017

marijnh commented Dec 6, 2017

marijnh commented Dec 6, 2017

Ashamandi commented Dec 7, 2017

marijnh commented Dec 13, 2017

Searching for Arabic words issue - #4784 #5099

Are you sure you want to change the base?

Searching for Arabic words issue - #4784 #5099

Conversation

Ashamandi commented Nov 28, 2017

marijnh commented Nov 28, 2017

Ashamandi commented Dec 5, 2017

marijnh commented Dec 5, 2017

Ashamandi commented Dec 5, 2017 • edited Loading

marijnh commented Dec 6, 2017

Ashamandi commented Dec 6, 2017

marijnh commented Dec 6, 2017

marijnh Dec 6, 2017

Choose a reason for hiding this comment

marijnh Dec 6, 2017

Choose a reason for hiding this comment

Ashamandi Dec 6, 2017

Choose a reason for hiding this comment

marijnh commented Dec 6, 2017

marijnh commented Dec 6, 2017

Ashamandi commented Dec 7, 2017

marijnh commented Dec 13, 2017

Ashamandi commented Dec 5, 2017 •

edited

Loading