Name translation improvements #86

msbarry · 2022-02-24T13:01:37Z

LanguageUtils is a straight port from openmaptiles logic, but there are a few issues with it. Please add a comment with any suggestions for improving the logic to assign element names, picking latin/nonlatin names, transliterating, etc. along with some example test cases to illustrate the desired behavior.

1ec5 · 2022-02-24T15:52:36Z

First of all, in fairness, I appreciate that this is a faithful port of the OpenMapTiles logic. However, the logic, which originally came from the mapnik-german-l10n package, makes some questionable decisions from an internationalization standpoint.

It’s unclear to me whether the name:latin field is intended to be a sanitized string of the sort that’s used for file names and URL slugs, or whether it’s intended to be presented to the user as a somehow more legible version of name for speakers of languages written in a Latin alphabet. If the latter, the regular expression should more closely match the set of characters supported by clients (such as Mapbox GL JS or MapLibre GL JS) and the fonts that are most likely to be used with them.

The LETTER regular expression explicitly matches the Basic Latin, Latin-1 Supplement, Latin Extended-A, and Latin Extended-B blocks of Unicode, but it excludes Latin Extended Additional (among other Latin blocks):

planetiler/planetiler-basemap/src/main/java/com/onthegomap/planetiler/basemap/util/LanguageUtils.java

Line 58 in 0d727c4

    
           private static final Pattern LETTER = Pattern.compile("[A-Za-zÀ-ÖØ-öø-ÿĀ-ɏ]+");

This presents a problem for Vietnamese, which is distributed across all four blocks. Effectively, any Vietnamese word that has a non-level tone gets devowelized, which looks broken to the user:

“Hiệp Phước” becomes “Hip Phc”, and “Tôn Đức Thắng” becomes “Tôn Đc Thng”.

Granted, “Vietnamese” is separate from “Roman” in terms of the scripts that TrueType fonts declare support for, but most modern fonts do support Vietnamese, including the OpenMapTiles fonts. A more robust filter would be the set of characters whose glyphs are included in these fonts.¹ Alternatively, if such tight coupling to the OpenMapTiles fonts is not desired, ^\p{IsLatin}+$ would be a simple but effective replacement for the current containsOnlyLatinCharacters() method, and \p{IsLetter} could replace LETTER.

If a machine-readable ASCII name is desired, instead of simply removing anything that isn’t a letter, it would be better to perform diacritic folding using the Latin-ASCII ICU transform or some other case folding library. Thus, “Hiệp Phước” would become “Hiep Phuoc”.

Tangentially, it might be useful for name to hide characters in scripts that MapLibre GL JS has a particular problem laying out even with the RTL plugin installed: Full "complex text" support: indic scripts, ligatures, kerning, etc. mapbox/mapbox-gl-js#4009. ↩

wipfli · 2022-02-24T16:02:47Z

Ardent defender of diacritics everywhere

You stand true to this @1ec5 :)

1ec5 · 2022-02-24T16:33:44Z

When the name contains no “Latin” characters, the code transliterates the text into Latin text using ICU’s Any-Latin transliterator:

planetiler/planetiler-core/src/main/java/com/onthegomap/planetiler/util/Translations.java

Lines 122 to 133 in 0d727c4

    
             private static final Transliterator TO_LATIN_TRANSLITERATOR = Transliterator.getInstance("Any-Latin"); 
        
             /** 
        
              * Attempts to translate non-latin characters to latin characters that preserve the <em>sound</em> of the word (as 
        
              * opposed to translation which attempts to preserve meaning) using ICU4j. 
        
              * <p> 
        
              * NOTE: This can be expensive and transliteration is synchronized deep down in ICU4j internals which limits benefit 
        
              * of running in multiple threads, so exhaust all other options first. 
        
              */ 
        
             public static String transliterate(String input) { 
        
               return input == null ? null : TO_LATIN_TRANSLITERATOR.transform(input); 
        
             }

This transliterator can be rough coming from some languages, because different schemes are often used depending on the source language, region, and use case. For example:

Cyrillic text is transliterated to Latin according to the ISO 9 standard, which is less biased toward a particular language but still differs from the more common transliteration schemes used in each language or country, as seen here in these Ukrainian park names.

Chinese text appears to be transliterated to Latin using Hanyu Pinyin. A map intended for lay readers would remove diacritics and spaces between syllables of compound words from these pinyin transliterations. Additionally, in Taiwan, the use of Hanyu Pinyin versus Tongyong Pinyin is a partisan and regional matter. (These names all end in words like “Ecological Protection Area” that would ideally be translated rather than transliterated.)

ICU has somewhat more reliable transliterators that require you to know the source language. As in #14 (comment), I think knowing the country the feature is in would be a good first step to improving the quality of these transliterations. At a minimum, detecting the source script would allow you to use script-specific transliterators and apply script-specific adjustments, like removing the diacritics from pinyin.

msbarry · 2022-02-25T02:43:56Z

Thanks for the incredible detail and pointers @1ec5! The usual use-case I see for name:latin and name:nonlatin is to provide dual labels when the local name is nonlatin, for example check out the style demos on https://stadiamaps.com/ , like:

On translation/transliteration, I think the preferred solution is for OSM elements to have a latin translation (name:en, name:de, int_name, etc...) - in that case, we won't attempt to transliterate at all. It's just in the case where no latin variation of the name exists that we need to infer it somehow.

1ec5 · 2022-02-25T20:01:14Z

Thanks, that makes sense. The GL style specification even supports rich text labels, so the second line of these bilingual labels can be formatted differently. Unfortunately, explicit translations are much less likely (and not universally accepted in OSM) for more obscure but common features like street names, so there’s still a high potential for name:latin to show up even if name:en, int_name, etc. are preferred over it.

1ec5 · 2022-02-25T20:34:14Z

Courtesy of @jleedev in OSMUS Slack, Wikidata is making an amusing cameo appearance in some places:

The ways in question are exquisitely typeset with en dashes, which the Latin detection code apparently regards as non-Latin, so it falls back to whichever name:* it can find that contains only Latin characters. It just happens that these ways are also tagged with name:etymology:wikidata, which is guaranteed to be set to an ASCII-only value. Some name:* subkeys don’t identify a language but instead refine the name somehow. Other common examples include name:signed, name:prefix, and name:pronunciation (which is “non-Latin” IPA anyways).

While it may seem contrived to put proper typographical characters on street names, the lack of support for them combined with overeager use of name subkeys can affect other features as well. For example, this POI in Germany is named with German quotation marks, lacks name:en or name:de, and could conceivably be tagged with name:etymology:wikidata=Q217964.

Fortunately, the same regular expression syntax in #86 (comment) can avoid these mishaps with some additional character classes: ^[\p{IsLatin}\p{IsPunctuation}]+$. A lot can be done by combining character classes. If the goal is merely to filter out linguistic content that’s in a different writing system, filtering on the general category and script should do the trick:

^[\P{IsLetter}[\p{IsLetter}&&\p{IsLatin}]]+$

This matches anything that isn’t a “letter” in the Unicode character database (i.e., not a letter, ideograph, or modifier letter), as well as anything that is a letter in the Latin script.

msbarry · 2022-03-23T10:21:05Z

@1ec5 seems like there are a few issues going on here. The most urgent one sounds like the names that aren't meant to be names (wikidata QIDs) showing up as road labels. Do you think it would be a reasonable fix for that to just limit the name:<language> tags that get checked to the languages the profile is using? For example in openmaptiles:

https://github.com/openmaptiles/openmaptiles/blob/8693822d506076d1cbf0d777d40d3a0a12986ce6/openmaptiles.yaml#L30-L99

1ec5 · 2022-03-23T16:58:19Z

Yes, sorry, I thought you had intended this to be an omnibus ticket about localization issues. It would be cleaner to track them in separate issues in the future.

The most robust fix would be to limit the subkeys to those that would be valid BCP 47 codes, like xx and xx-YY and xx-YY-ZZZZ. But limiting it to the languages in the profile should work too.

I think the revised non-Latin detection code would still be worth pursuing regardless. With just the smaller fix you’re suggesting, there will be cases where an alternative language’s name is arbitrarily chosen just because of a “non-Latin” character.

msbarry · 2022-03-24T00:19:22Z

No worries! I did intend this to be an omnibus ticket, just trying to see if I can extract isolated issues from it to work on. I'll give those a shot in #146

gebner · 2022-06-12T13:24:03Z

Chinese text appears to be transliterated to Latin using Hanyu Pinyin. A map intended for lay readers would remove diacritics and spaces between syllables of compound words from these pinyin transliterations. Additionally, in Taiwan, the use of Hanyu Pinyin versus Tongyong Pinyin is a partisan and regional matter. (These names all end in words like “Ecological Protection Area” that would ideally be translated rather than transliterated.)

For Japanese names, the situation is even worse since the Pinyin romanization is not just controversial but incomprehensible. The following should have been transliterated as Mukogasaki Koen (or Mukogasaki Park, 向ヶ崎公園):

Romanizing Japanese is nontrivial though, and ICU doesn't support it AFAICT. We'd need to use a morphological analyzer, like kuromoji (which is far from perfect and would give "kōgasakikōen" in this example).

jleedev · 2022-06-12T13:54:38Z

There are also useless values in name_de and name_en for some reason.

1ec5 · 2022-07-02T22:06:06Z

On translation/transliteration, I think the preferred solution is for OSM elements to have a latin translation (name:en, name:de, int_name, etc...) - in that case, we won't attempt to transliterate at all. It's just in the case where no latin variation of the name exists that we need to infer it somehow.

For many languages, there are keys such as name:ko-Latn and name:sr-Latn that allow mappers to choose the transliteration system most appropriate to a given language. Ideally the tiles would include those language-qualified property names, because name:latin is just as ambiguous as name.

Wikidata also has a variety of properties to indicate the transliteration of a place name, though the long-term approach would be to look at the transliterations stored in lexicographical data.

j9d3it · 2024-12-02T23:44:58Z

Hi, just finding this issue. For place names in Japanese I've found this Python converter to be really great. https://github.com/polm/cutlet

msbarry added the bug Something isn't working label Mar 8, 2022

This was referenced Mar 24, 2022

[BUG] Limit latin name extraction to valid language codes with no non-latin characters #146

Closed

Improve name:latin logic #147

Merged

msbarry removed the bug Something isn't working label Mar 25, 2022

1ec5 mentioned this issue Jul 2, 2022

Gloss names of places in the local language osm-americana/openstreetmap-americana#471

Closed

This was referenced Nov 30, 2022

Gloss city names in the local language osm-americana/openstreetmap-americana#592

Merged

Nix name:latin fallback from English osm-americana/openstreetmap-americana#605

Merged

1ec5 mentioned this issue Nov 3, 2024

Some Japanese place names are transliterated to their alphabetical form using Chinese phonetic readings. hyperknot/openfreemap#24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Name translation improvements #86

Name translation improvements #86

msbarry commented Feb 24, 2022

1ec5 commented Feb 24, 2022 •

edited

Loading

wipfli commented Feb 24, 2022

1ec5 commented Feb 24, 2022 •

edited

Loading

msbarry commented Feb 25, 2022

1ec5 commented Feb 25, 2022

1ec5 commented Feb 25, 2022 •

edited

Loading

msbarry commented Mar 23, 2022 •

edited

Loading

1ec5 commented Mar 23, 2022

msbarry commented Mar 24, 2022 •

edited

Loading

gebner commented Jun 12, 2022

jleedev commented Jun 12, 2022

1ec5 commented Jul 2, 2022

j9d3it commented Dec 2, 2024

Name translation improvements #86

Name translation improvements #86

Comments

msbarry commented Feb 24, 2022

1ec5 commented Feb 24, 2022 • edited Loading

Footnotes

wipfli commented Feb 24, 2022

1ec5 commented Feb 24, 2022 • edited Loading

msbarry commented Feb 25, 2022

1ec5 commented Feb 25, 2022

1ec5 commented Feb 25, 2022 • edited Loading

msbarry commented Mar 23, 2022 • edited Loading

1ec5 commented Mar 23, 2022

msbarry commented Mar 24, 2022 • edited Loading

gebner commented Jun 12, 2022

jleedev commented Jun 12, 2022

1ec5 commented Jul 2, 2022

j9d3it commented Dec 2, 2024

1ec5 commented Feb 24, 2022 •

edited

Loading

1ec5 commented Feb 24, 2022 •

edited

Loading

1ec5 commented Feb 25, 2022 •

edited

Loading

msbarry commented Mar 23, 2022 •

edited

Loading

msbarry commented Mar 24, 2022 •

edited

Loading