Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searching for Arabic words issue - #4784 #5099

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Ashamandi
Copy link

Hello @marijnh, this PR to fix ##4784
at this commit I added two functions one for normalize Arabic strings and the other to check ifArabic

I am expecting to apply the normalization on the search input and the doc itself .. Could you help me figure out where exactly can I apply it to normalize the search input and the whole doc itself before start searching?

@marijnh
Copy link
Member

marijnh commented Nov 28, 2017

Could you help me figure out where exactly can I apply it to normalize the search input and the whole doc itself before start searching?

I don't understand this question—why do you need to normalize before you start searching?

It appears that the normalization is strickly 'expanding' (never shortens the length of a string) after all, so that's good. The code looks good except that you're turning off case- and unicode-normalization when there's Arabic in the string. The linter appears to be complaining about trailing whitespace.

@Ashamandi
Copy link
Author

I don't understand this question—why do you need to normalize before you start searching?

what I think about is normalizing both search input and the whole doc to put them into same char shaping before start searching so the searching can detect all matches

as an example :

searching

What I imagine solving that is
1- Normalize search input أيه to "ايه"
2- Normalize the whole doc so all its 4 words will be converted to "ايه" as well
3- Run searching function
the output will be as shown before.

but when I applied it on noFold and doFold only this didn`t solve the problem, the normalization done but not all matches detected

@marijnh
Copy link
Member

marijnh commented Dec 5, 2017

so the searching can detect all matches

You mean for the purpose of highlighting them? That is done in searchOverlay in addon/search/search.js, which currently doesn't really do any normalization beyond case, for which it is relying on the regexp i flag. To support normalization there, we'd have to use a different approach.

Can we start by landing this change just for the search cursor, and then look into the highlighting issue separately?

@Ashamandi
Copy link
Author

Ashamandi commented Dec 5, 2017

Can we start by landing this change just for the search cursor, and then look into the highlighting issue separately?

Sure ... no problem for that.

The code looks good except that you're turning off case- and unicode-normalization when there's Arabic in the string.

Sorry didn`t get what you mean here, Do you mean that you prefer chars instead of its unicodes like that

//Normalization for Arabic
  var normalizeArabicChars = function (s) {
    function filter(c) {
      switch (c) {
      // ALEF Chars
      case 'إ' :
      case 'أ' :
      case 'آ' :
      case 'ٵ' :
      case 'ٳ' :
      case 'ٲ' :
      case 'ٱ' :
        return 'ا'
           .
           .
      default :
        return c
    }

@marijnh
Copy link
Member

marijnh commented Dec 6, 2017

Sorry didn`t get what you mean here

In this line (and the one after it)...

doFold = function(str) { return isArabic(str)? normalizeArabicChars(str) : str.normalize("NFD").toLowerCase() }

... toLowerCase won't be called for a string like "XYايه", but it should (in addition to normalizeArarabicChars and .normalize)

@Ashamandi
Copy link
Author

ooooh ... I got what you mean, so I think there is no need to check isArabic before applying the normalization

if (String.prototype.normalize) {
    doFold = function(str) { return normalizeArabicChars(str.normalize("NFD").toLowerCase()) }
    noFold = function(str) { return normalizeArabicChars(str.normalize("NFD")) }

Will it be okey?

@marijnh
Copy link
Member

marijnh commented Dec 6, 2017

Since normalizeArabicChars looks somewhat expensive, I'd prefer if you kept the check. Maybe something like str = str.normalize("NFD").toLowerCase(); if (isArabic(str)) str = normalizeArabicChars(str); return str?

case 'ٸ' :
return 'ي'
case 'ئ':
return 'ي ء'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this space in the returned string intentional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, `.normalize("NFD") already separates this character into \u064a and \u0654, which seems similar to what you're doing, and might already cover this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comment .... yes, you are right, it seems that when I converted each Unicode into its char I added a space by mistake .... sorry for that.
About normalize("NFD"), Actually I tested it but it didn`t work , but for sure if you test it, I can remove this case to reduce checks.

@marijnh
Copy link
Member

marijnh commented Dec 6, 2017

Is there a standard or document that your normalizations are based on? Whether something works or doesn't work depends on what you expect it to do—and if we're going to implement a normalization, I'd prefer for it to be based on some standard or at least widespread convention.

@marijnh
Copy link
Member

marijnh commented Dec 6, 2017

Does the problem you are trying to solve also occur with diacritics encoded as separate characters (for example \u0654 "hamza above")? If so, we probably need something more general than these replacements—filtering out a whole character categories, such as accents and diacritics. I notice a lot, but not all, of the characters in your code are already decomposed to a base character plus an extending character when you call .normalize("NDF") on them, so that could be part of the solution.

But, I guess, not the entire solution. Is there some Unicode concept that covers the rest?

@Ashamandi
Copy link
Author

Does the problem you are trying to solve also occur with diacritics encoded as separate characters (for example \u0654 "hamza above")?

No, my problem is limited to some chars that are considered equaled but has different Unicode chars
and these chars are AL-ALEF, AL-TAA AL-MARBOTA and AL-YAA.
and if we talk about Al-ALEF as an example: ا, أ , آ, إ ,.. we found that all the previous ALEF shapes should be detected when searching for any one of them.
in the same way when we take about TAAA MARBOTA and YAA different shapes.

@marijnh
Copy link
Member

marijnh commented Dec 13, 2017

I asked around about whether these characters should be considered equal, and opinions differ (see this and this).

I'd like to add some kind of generalized stripping of extending characters (so that 'e' matches 'é' and 'ٵ' matches 'ا'), which would address some of the equivalences in this pull request. For things like 'ي' and 'ى', which are similar but not technically equivalent, I prefer not to have built-in normalization (since the set of these characters, across the various scripts, is huge, and there doesn't appear to be a standard that defines how to do this generally).

I am open to having an extension mechanism through which user code can add additional normalizations, which you could use to add the ones not covered by the extending-character-removal. Does that sound reasonable?

(It might be a while until I get around to implementing all this, unfortunately.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants