Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.find() method to retrieve all the group of diacritics from a specific char #20

Open
DiegoZoracKy opened this issue Jun 7, 2016 · 7 comments

Comments

@DiegoZoracKy
Copy link

DiegoZoracKy commented Jun 7, 2016

Hi @andrewrk,

What do you think about this? Right now i'm facing a case where i need to have a group of all possible diacritics from a specific char. I remembered about your great list of diacritics, and that your package is named as 'diacritics', and not something like 'remove-diacritics', so i thought that would be better to extend it with one more method instead of create another package.

I already created the new method:

function findDiacritics(chr) {

  var diacriticsFound = replacementList.find( o => o.base == chr || o.chars.indexOf(chr) >= 0 );
  return (diacriticsFound)? diacriticsFound.base + diacriticsFound.chars : null;

}

If you think it is ok, i can send you a pull request.

@thejoshwolfe
Copy link
Collaborator

what about just exporting replacementList? Then you can this search in your code, or any other search you might want to do.

@DiegoZoracKy
Copy link
Author

Exporting replacementList would be good too. But just with the list, me, and other developers working on a similar case, would have to create this same code.

Is the same goal of the remove method, instead of just having the list, you have created the method to help. So i thought that it could be good to have this helper in this package. But it's ok if you don't agreed. Do you think that you will update it to export the replacementList ?

@thejoshwolfe
Copy link
Collaborator

i'll have to defer to @andrewrk on this, but in my own opinion, i have to admit, i don't really understand what the function is supposed to be used for. In particular, you lose some information when you concatenate 'AE' with "\u00C6\u01FC\u01E2". What are you going to do with the "group of all possible diacritics" when you get it? If I were going to write documentation for this function, I'd be at a loss to describe what it really does without just describing the code.

Can you give more information on the usecase for this function?

@DiegoZoracKy
Copy link
Author

To make an diacritic insensitive RegExp. Example: I have a text which contains the word 'ação'. Assuming that we are handling some kind of search engine, where the input could be written correctly as 'ação', but also it can have a typo like 'aç__a__o', 'a__c__ão', etc.

By having the group of diacritics i can easily create a RegExp like: /a[ccćĉċčçḉƈȼↄ][aaẚàáâầấẫẩãāăằắẵẳȧǡäǟảåǻǎȁȃạậặḁąⱥɐɑ]o/i

@thejoshwolfe
Copy link
Collaborator

thejoshwolfe commented Jun 7, 2016

did you mean /[aaẚàáâầấẫẩãāăằắẵẳȧǡäǟảåǻǎȁȃạậặḁąⱥɐɑ][ccćĉċčçḉƈȼↄ][aaẚàáâầấẫẩãāăằắẵẳȧǡäǟảåǻǎȁȃạậặḁąⱥɐɑ][oⓞoòóôồốỗổõṍȭṏōṑṓŏȯȱöȫỏőǒȍȏơờớỡởợọộǫǭøǿꝋꝍɵɔᴑ]/i? It looks like the function is prepared to look up simple ascii characters as well (o.base == chr).

isn't there a problem with multi-char diacritics like 'Æ'? Wouldn't the regex for "Cæsar" fail to match against the string "Caesar"?

@thejoshwolfe
Copy link
Collaborator

thejoshwolfe commented Jun 7, 2016

how about this function:

function charToRegexPattern(chr) {
  for (var i = 0; i < replacementList.length; i++) {
    var replacement = replacementList[i];
    if (replacement.chars.indexOf(chr) === -1) continue;
    if (replacement.base.length > 1) {
      // allow the complete multi-char sequence or a literal diacritic character
      return '(?:' + replacement.base + '|[' + replacement.chars + '])';
    } else {
      // allow the ascii char or a literal diacritic character
      return '[' + replacement.base + replacement.chars + ']';
    }
  }
  // either already ascii or not a diacritic char
  return chr;
}

It's arguably less "general purpose", since it returns strings formatted for regex, but i think it's the only way to make it actually work for multi-char sequences, like "ae".

@DiegoZoracKy
Copy link
Author

Yes @thejoshwolfe, i meant exactly like you said on the first RegExp. I just kept it short to give you a simple example.

With the version that i wrote i would use in a case like this:

function toRegExp(str){
    return RegExp(str.split('').map(chr => `[${diacritics.find(chr) || chr}]`).join(''), 'gi');
}

let str = 'acaoae1ae';
let strDiacritic = 'açãoae1æ';

// RegExp will be: /[aⓐaẚàáâầấẫẩãāăằắẵẳȧǡäǟảåǻǎȁȃạậặḁąⱥɐɑ][ccⓒćĉċčçḉƈȼꜿↄ][aⓐaẚàáâầấẫẩãāăằắẵẳȧǡäǟảåǻǎȁȃạậặḁąⱥɐɑ][oⓞoòóôồốỗổõṍȭṏōṑṓŏȯȱöȫỏőǒȍȏơờớỡởợọộǫǭøǿꝋꝍɵɔᴑ][aⓐaẚàáâầấẫẩãāăằắẵẳȧǡäǟảåǻǎȁȃạậặḁąⱥɐɑ][eⓔeèéêềếễểẽēḕḗĕėëẻěȅȇẹệȩḝęḙḛɇǝ][1][aeæǽǣ]/gi
// And "str" it will match "strDiacritic"
str.match(toRegExp(strDiacritic))

See that the expected input can be a diacritic, or a base char, while in your charToRegexPattern you expects only a diacritic. The base char would never be "expanded" so it won't work in my example where the input 'acao' should match 'ação'. I wouldn't be able to know what is the possible diacritic for a base char.

And yes, this version is not handling the input of a diacritic of length > 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants