-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Searching for Arabic words issue - #4784 #5099
base: master
Are you sure you want to change the base?
Conversation
I don't understand this question—why do you need to normalize before you start searching? It appears that the normalization is strickly 'expanding' (never shortens the length of a string) after all, so that's good. The code looks good except that you're turning off case- and unicode-normalization when there's Arabic in the string. The linter appears to be complaining about trailing whitespace. |
what I think about is normalizing both search input and the whole doc to put them into same char shaping before start searching so the searching can detect all matches as an example : What I imagine solving that is but when I applied it on noFold and doFold only this didn`t solve the problem, the normalization done but not all matches detected |
You mean for the purpose of highlighting them? That is done in Can we start by landing this change just for the search cursor, and then look into the highlighting issue separately? |
Sure ... no problem for that.
Sorry didn`t get what you mean here, Do you mean that you prefer chars instead of its unicodes like that
|
In this line (and the one after it)...
... |
ooooh ... I got what you mean, so I think there is no need to check isArabic before applying the normalization
Will it be okey? |
Since |
addon/search/searchcursor.js
Outdated
case 'ٸ' : | ||
return 'ي' | ||
case 'ئ': | ||
return 'ي ء' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this space in the returned string intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, `.normalize("NFD") already separates this character into \u064a and \u0654, which seems similar to what you're doing, and might already cover this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for comment .... yes, you are right, it seems that when I converted each Unicode into its char I added a space by mistake .... sorry for that.
About normalize("NFD"), Actually I tested it but it didn`t work , but for sure if you test it, I can remove this case to reduce checks.
Is there a standard or document that your normalizations are based on? Whether something works or doesn't work depends on what you expect it to do—and if we're going to implement a normalization, I'd prefer for it to be based on some standard or at least widespread convention. |
Does the problem you are trying to solve also occur with diacritics encoded as separate characters (for example But, I guess, not the entire solution. Is there some Unicode concept that covers the rest? |
No, my problem is limited to some chars that are considered equaled but has different Unicode chars |
I asked around about whether these characters should be considered equal, and opinions differ (see this and this). I'd like to add some kind of generalized stripping of extending characters (so that 'e' matches 'é' and 'ٵ' matches 'ا'), which would address some of the equivalences in this pull request. For things like 'ي' and 'ى', which are similar but not technically equivalent, I prefer not to have built-in normalization (since the set of these characters, across the various scripts, is huge, and there doesn't appear to be a standard that defines how to do this generally). I am open to having an extension mechanism through which user code can add additional normalizations, which you could use to add the ones not covered by the extending-character-removal. Does that sound reasonable? (It might be a while until I get around to implementing all this, unfortunately.) |
Hello @marijnh, this PR to fix ##4784
at this commit I added two functions one for normalize Arabic strings and the other to check ifArabic
I am expecting to apply the normalization on the search input and the doc itself .. Could you help me figure out where exactly can I apply it to normalize the search input and the whole doc itself before start searching?