Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible to exclude "İnşaat Malları" as generic #8261

Closed
serhii-muchychka opened this issue Jun 11, 2023 · 5 comments
Closed

Impossible to exclude "İnşaat Malları" as generic #8261

serhii-muchychka opened this issue Jun 11, 2023 · 5 comments
Labels

Comments

@serhii-muchychka
Copy link
Collaborator

serhii-muchychka commented Jun 11, 2023

I tried to do this:
f4e5e1d

Steps:

  1. I added "^İnşaat Malları$" to generics in the appropriate file
  2. Deleted "İnşaat Malları" brand entry
  3. run npm run build 1st time - OK (The script replaces letters with lowercase ones in the new generic string)
  4. run npm run build 2nd time - script re-add the brand entry

It seems that the problem relates to letter "İ"

@bhousel
Copy link
Member

bhousel commented Jun 12, 2023

Looks like another edge case like what we had in #5017
It's 2 years later so I'll play around and see whether there are newer better ways to do this.

@1ec5
Copy link
Member

1ec5 commented Jun 12, 2023

Long-term, we shouldn’t do any manual diacritic-folding to compare strings, even with the help of the libraries in #5017 (comment). Especially since NSI tends to compare whole strings, String.prototype.localeCompare and Intl.Compare are a lot more robust. However, the behavior depends on the language you pass in. I guess individual entries would need to be able to specify the language name is in, since OSM doesn’t do that?

@bhousel
Copy link
Member

bhousel commented Jun 12, 2023

Ok I added some fixes that will keep the generic "İnşaat Malları" from sneaking back into the index.

This was tricky because: you'd think that case insensitive regex /i would catch
both upper and lower case variants of this, but it doesn't.

Then, I tried to match both variants with an exclude regex like '^(İ|i̇)nşaat malları$',
but toLowerCasing that regex in our file_tree writing code was changing the 'İ'.

So for now, our build scripts can just avoid toLowerCasing a string with a 'İ' in it.

@bhousel
Copy link
Member

bhousel commented Jun 12, 2023

Long-term, we shouldn’t do any manual diacritic-folding to compare strings, even with the help of the libraries in #5017 (comment)

@1ec5 Can you say more what you mean by this? I kind of think we do need to continue to diacritic fold the strings?
We mostly do this to catch typos in the OSM tags that we're matching.

I guess our basic use case is: if someone creates something in OSM with name=Haagen Dazs , Rapid can suggest the tag name=Häagen-Dazs instead. I can't think of a situation where the two locally used names would differ only by a diacritic mark.

@serhii-muchychka
Copy link
Collaborator Author

It turns out that this is a known problem. There is an article about it in Wikipedia, maybe someone will be interested, so I leave the link here:
https://en.wikipedia.org/wiki/Dotted_and_dotless_I_in_computing

bhousel added a commit to osmlab/osm-community-index that referenced this issue Jun 13, 2023
This issue has some more info.
osmlab/name-suggestion-index#8261

Don't know whether this letter is used by osm-community-index communities,
but we might as well all use the same simplify.js code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants