Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What about Han unification? #2208

Open
sommerluk opened this issue Jul 2, 2016 · 26 comments
Open

What about Han unification? #2208

sommerluk opened this issue Jul 2, 2016 · 26 comments

Comments

@sommerluk
Copy link
Collaborator

sommerluk commented Jul 2, 2016

[This description is regularly updated to summarize the current state of discussion in the comments.]

What is the problem of Han unification for openstreetmap-carto?

The problem of Han unification is a general problem that is independent of any specific font!

  • Unicode encodes abstract characters (“meanings of signs”). It does not encode glyphs (“specific graphical representations of an abstract character”).

  • There are three Han scripts: The Chinese Han script, the Japanese Han script and the Korean Han script. According to the initials it is abbreviated “CJK scripts”.

  • A wide variety of abstract characters is shared between the CJK scripts.

  • There are glyphs that have the same appearance in all CJK scripts. There are other glyphs that are different in all CJK scripts.

  • Nevertheless, native language speakers expect to see the glyphs they are used to seeing (language-specific glyphs). Different to the Unicode consortium, they consider that the different glyph form makes also a difference in the meaning of the sign. They feel that the other glyph forms are a foreign language.

  • Furthermore, Chinese Han has two different script variants: simplified (People’s Republic of China) and traditional (Hong Kong, Macao, Taiwan). So it’s not enough to know the language, but you also have to know the script variant.

  • Furthermore, even Traditional Chinese Han glyphs are usually rendered differently in three different regions (Hong Kong, Macao, Taiwan). So it’s not enough to know the language and the script variant, but you also have to know the target region.

  • It is not possible with plain Unicode to distinguish these forms. (IVD does not help with Han unification.)

  • Good CJK fonts provide all these glyphs for all language variants. Via an OpenType feature, you can access the glyph variant that you need.

  • openstreetmap-carto uses yet the Noto fonts, which do support all Han target languages, Han target variants and Han target regions (except Macao).

  • The problem is how to make the choice between all the available glyph forms.

  • Web pages solve the problem by using HTML lang attribute. It contains an IETF language tag (BCP-47), which can provide information about language, script and region. So the rendering engine can easily choose the appropriate default glyph for this language-script-combination because it knows the target language and the target script and the target region.

  • openstreetmap-carto has currently no knowledge about the target language or the target script or the target region (in the “name” key) and no region-specific rendering rules.

  • openstreetmap-carto policy is however to display text in the native language.

Question: How to render CJK names and other text in the native language?

Does this problem also exist in other regions of the world?

Yes.

Technically, these problems are almost identical to the problem of Han unification.

Are there other problems that have the same technical base?

  • Arabic: It seems like Pakistanis prefer a specific typographic style (Nastaliq) when using the Arabic script, which is in widespread use in Pakistan for example for the Urdu language. Almost all over the world, there are several typographic styles that are used simultaneously. (Think of Serif fonts, Sans fonts and Comic fonts for the Latin alphabet.) In Pakistan, however, there seems to be only one single typographic style in use. This is, however, a different question than the above ones: When talking about Han unification, we are talking about often completely different letter forms sharing the same Unicode code point, but having the same typographic style in our map. In Pakistan, however, we are talking about essentially similar letter forms being rendered in a different typographic style in our map (whose design is quite different from the Sans style we use all over the rest of the world). Noto itself provides various styles for Arabic, including also a Nastaliq style, each of them in different font files.

What is the current situation at openstreetmap-carto?

If we default to Chinese glyph forms, then also Japanese city names will be rendered with Chinese glyph forms, and Japanese people will feel like it is a Chinese map. If we default to Japanese glyph forms, then also Korean city names will be rendered with Japanese glyph forms, and Korean people will feel like it is a Japanese map…

Current defaults:

  • CJK: Default is Japanese. (In Korea, the Hangul script seems more common than the Han script anyway. So we would not gain much by using Korean as default. Between Japanese and Chinese, it seems that Japanese people are more sensible for this issue, so we go for Japanese. However, this is subjective.)
  • Cyrillic: openstreetmap-carto defaults to whatever the Noto (LGC) default is.
  • The letter “eng”: openstreetmap-carto defaults to whatever the Noto (LGC) default is.
  • Syriac: Default is Eastern Syriac Variant. (We suppose there are more speakers of Eastern Syriac dialects than of Western Syriac dialects?)
  • Arabic: Noto Sans Arabic is used.

What is necessary for a better solution?

1. Knowledge about the target language/script/region of each label

  • Selection by static polygons around Japan, China, Taiwan, Korea, Hong Kong, Macao: SQL queries are messy when they are based on polygons. CartoCSS does not support multi-polygon selection. Finally, this selection would also not be directly based on the OSM data.
  • Selection by comparing name with name:jp, name:zh … does not work. Example: The node http://www.openstreetmap.org/node/25248662 (english: Beijing) has name=北京市 and name:ja=北京市 and name:zh=北京市. They are identical. We cannot reliably determine the language of the name value.
  • Selection by information in the database. It might be the clearest solution to have the language information for the name value in the OSM database itself. As a separate tag that specifies the language code of the language that was used in the name value. Furthermore, for Chinese rendering, an information about the region (Taiwan, Hong Kong, Macau, People’s Republic of China) is necessary.

2. A way to get Mapnik actually render the correct localized fonts

  • Mapnik does not currently allow us to control the localization with OpenType’s locl feature. But we could make individual font lists for each of the four main CJK languages, and choose the actually used font list individually for each label. This is possible because Noto provides us four different versions of the CJK fonts: Each of them contains the same set of glyphs, but each of them defaults to a different representation (Japanese, Traditional Chinese, Simplified Chinese, Korean). So we can work around the lack of support for locl in Mapnik. However, this is only possible for the Noto CJK fonts, and not for the Cyrillic script and the eng letter.
  • The better solution would obviously be support for locl language settings in Mapnik, and the Mapnik team is interested in implementing this for Mapnik 3.1. Furthermore, we need CartoCSS (our interface to Mapnik) to support this new Mapnik feature also. This would enable us to also handle correctly the Russian, Bulgarian, Serbian and Macedonian variants of the Cyrillic alphabet and the eng letter (here locl is the only way to get this rendered correctly with Noto). This will not work for Noto Sans Arabic vs. Nastialiq, because these are two different font files.
@clkao
Copy link

clkao commented Jul 2, 2016

Is it possible to add some sort of regional specific selector (based on country or bbox) as cartocss extension? I think that's the basic facility required for using different glyph forms for their respective regions when preferred language is unspecified.

The default rendering should be using the region-specific glyph forms, assuming name is in that language. And when generating Chinese-version tiles (preferring name:zh) there should be overriding styles using the Chinese glyph forms.

@mxa
Copy link

mxa commented Jul 2, 2016

I'm not sure, but the biggest issues for the map might be between Traditional Chinese and Japanese. Maybe also for traditional (Taiwan) vs. simplified (Mainland China) Chinese. The map for Korea has almost no place names with Chinese characters, They are all in Hangeul and there is no overlap with Chinese or Japanese. However, some objects in Korea have Hanja names too. It usually is in the name:zh tag, which is wrong, but there is no better proposal. See this discussion https://lists.openstreetmap.org/pipermail/talk-ko/2015-October/000228.html

@pnorman
Copy link
Collaborator

pnorman commented Jul 2, 2016

Is it possible to add some sort of regional specific selector (based on country or bbox) as cartocss extension? I think that's the basic facility required for using different glyph forms for their respective regions when preferred language is unspecified.

Not within CartoCSS as it exists now. If it got added we could consider using it.

It's possible to do something with more complicated SQL queries, but this gets messy, and is even messier without defining functions in PostgreSQL, which we avoid.

@pnorman
Copy link
Collaborator

pnorman commented Jul 2, 2016

And when generating Chinese-version tiles (preferring name:zh) there should be overriding styles using the Chinese glyph forms.

Chinese-specific tiles would require modifying the style, so modifying the font list for a better rendering is pretty easy for someone to do once they've started modifications.


fwiw, I think we're stuck with the Han unification problems with current technologies.

@sommerluk
Copy link
Collaborator Author

sommerluk commented Aug 2, 2016

Unicode 9.0.0 core specification http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf has an implementation guideline about language information in plain text and especially Han unification: Chapter 5.10.

@pnorman
Copy link
Collaborator

pnorman commented Aug 3, 2016

Han unification: Chapter 5.10.

I've read over it and it doesn't help much. In the situations they describe there is implicit language information by the reader being either Japanese or Chinese and having corresponding fonts. With server-side rendering (and most client-side) the fonts are supplied so none of the scenarios are what we have.

I do note that "plain text remains legible in the absence [of format specifications]"

@mboeringa
Copy link

I do note that "plain text remains legible in the absence [of format specifications]"

And:

"The goal and methods of Han unification were to ensure that the text remained legible."

"There should never be any confusion in Unicode, because the distinctions between the unified characters are all within the range of stylistic variations that exist in each country."

@Artoria2e5
Copy link

Artoria2e5 commented Apr 12, 2017

Some improvements may be possible based on 5e5fb3b by reducing JP coverage to the Noto/Source Han Sans subset font, and padding it up with a SC or TC variant that has all the glyphs. This way all characters used by Japanese would be made Japanese, while the rest can be left to be written in the Chinese ways.

Correction for my comment in #2608: It appears that Japanese, for example, don't quite use the character "门": https://ja.wiktionary.org/wiki/%E9%97%A8. Since it's possibly still in the subset file (according to the source han sans readme the subset still covers all the JIS X characters), someone may have to do some font editing to kick it out.

@sommerluk
Copy link
Collaborator Author

sommerluk commented Apr 22, 2017

Since it's possibly still in the subset file […], someone may have to do some font editing to kick it out.

@Artoria2e5 If I understand you correctly, the proposal with the region-specific subsets is not a solution for our problem, right?

@sommerluk
Copy link
Collaborator Author

I’ve made some further investigations and updated the issue description (“first comment”).

It seems to me that the only reliable way to support this is having in the OSM database itself the information about the language that was used in the name tag.

@Artoria2e5
Copy link

@sommerluk Taking subsets can be a good enough solution as you can isolate characters not (usually) used by one region and give it a writing style from a region that commonly uses it. The region subset files can appear quite a bit too inclusive for any given region though.

Name tagging is the ideal solution around this.

@springmeyer
Copy link
Contributor

It seems to me that the only reliable way to support this is having in the OSM database itself the information about the language that was used in the name tag.

This could be paired nicely with Mapnik if Mapnik were extended to dynamically read this value in from the database and pass it to harfbuzz. I've sketched out how that could work at mapnik/mapnik#3655 (comment).

@jojo4u
Copy link

jojo4u commented Apr 30, 2017

Using a name_lang=[lang code] tag or similar would solve the Han problem and the duplication of name tags by because the name:[local lang code]=[local name] tag could be omitted.

@sommerluk
Copy link
Collaborator Author

@springmeyer Thanks!

@sommerluk
Copy link
Collaborator Author

I’ve written a proposal for language information tagging at https://wiki.openstreetmap.org/wiki/Proposed_features/Language_information_for_name

Feedback is welcome.

@nebulon42
Copy link
Contributor

@sommerluk I have thought about something similar, but limited to multilingual names: https://wiki.openstreetmap.org/wiki/User:Nebulon42/Multilingual_names

Maybe something there is of value for this problem. Or vice versa :)

@sommerluk
Copy link
Collaborator Author

@nebulon42 Thanks! I did not know your proposal. Great work!

The syntax is essentially the same: semicolon-separated list of the language codes that are already used for name:=

Additional to your proposal, I simply admit single-language values. Would you consider that my proposal is a superset of your proposal and would be enouth to also serve your purposes?

@nebulon42
Copy link
Contributor

I did not know your proposal.

Yes, I drafted it some time ago but did not have the time to push it further.

Would you consider that my proposal is a superset of your proposal and would be enouth to also serve your purposes?

Definitely. If you have the time and energy to push this further I really appreciate that. If you need some help please tell me. If anything on my Wiki page suits your needs for the proposal like the example renderings etc. please do not hesitate to use it.

I saw that there is some progress on the Mapnik side. If there is anything that needs to be done for CartoCSS, please create an issue and I will try to get it into the next release.

@sommerluk
Copy link
Collaborator Author

Yes, that sounds good.

The RFC at the tagging mailing list is done. The multilingual name processing is added as use case. Overall, the proposal is still quite short, also because my english is not so good. Hopefully that’s not an obstacle…

About Mapnik and CartoCSS: The most important part is getting support for controling locl via a property in the stylesheet. As far as I know, that’s not done yet. Once it’s done, will it be automatically available in CartoCSS, or is it necesarry to add in manually?

@nebulon42
Copy link
Contributor

If it's only about the property then adding it to https://github.com/mapnik/mapnik-reference for 3.1.0 or which version it is released in would be sufficient. Then carto needs to used that updated reference in a new version and it is available.

@sommerluk
Copy link
Collaborator Author

The voting for a language tag is now open at https://wiki.openstreetmap.org/wiki/Proposed_features/Language_information_for_name

Support is welcome ;-)

@kocio-pl
Copy link
Collaborator

kocio-pl commented May 6, 2018

There's a current proposition for determining languages of a name tags:

https://wiki.openstreetmap.org/wiki/Key:default_language

@pnorman
Copy link
Collaborator

pnorman commented May 22, 2018

If Mapnik <TextSymbolizer> gets locl support and it makes it into CartoCSS, we'd still have to think hard about it. This would add a decent amount of complexity to the SQL and bump the minimum Mapnik version to 3.1.0 (or something more recent if it takes longer)

@c933103
Copy link

c933103 commented Mar 19, 2019

iirc the "beta" character in Greek also have some similar situation? Is there any exhaustive list on all language/script/glyph that are affected by the Unicode glyph unification?

@sommerluk
Copy link
Collaborator Author

iirc the "beta" character in Greek also have some similar situation? Is there any exhaustive list on all language/script/glyph that are affected by the Unicode glyph unification?

To both questions: I don’t know, neither did I found a good answer searching on the web.

@bdon
Copy link

bdon commented Aug 21, 2020

Working on Step 2 above here: mapnik/mapnik#3655 (comment)

For Step 1, my feeling is that for the zh-hant locale this problem is so pervasive that an OSM data based approach isn't realistic - essentially any character that uses the 辶 radical is affected including common names of linear features like 大道(Boulevard), 步道(Trail), etc. (before/after using my lang fork:)

Screen Shot 2020-08-21 at 5 49 27 PM

Screen Shot 2020-08-21 at 5 49 49 PM

For the region-based approach, I'm skeptical that a polygon-based approach will result in an elegant implementation; what about a raster/bitmap solution? For example, a GeoTiff where each pixel encodes an 8-bit value corresponding to BCP47 language tag that can be sampled for every symbolizer - for my own uses I would probably implement this in-memory directly in the mapnik C++ code, but for OSM Carto I'm not sure how this would fit into the tile rendering path.

A bitmap based approximation based on Z14 tiles might be detailed enough and would be 16384x16384 px, which is a reasonable size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests