[zhmetrics-uptex] Enhancement of upTeX metrics #292

aminophen · 2017-07-16T23:32:45Z

Hello, this is Japanese TeX Development Community.

We are considering about improving standard upTeX metrics for simplified Chinese, bundled in https://github.com/texjporg/uptex-fonts . We find that CTeX-org has a derivative called zhmetrcis-uptex, so I decided to ask you what is the most desirable.

When revisiting currently available metrics (upschrm-{h,v}, upschgt-{h,v}), the following questions are raised;

Add 〈·〉U+00B7 MIDDLE DOT to the type of 〈・〉U+30FB KATAKANA MIDDLE DOT ?
- U+00B7 is included in the common Chinese character set, GB2312 01-04 and GBK 0xA1A4. In UniGB-UTF16-H, U+00B7 is mapped to Adobe-GB1 CID+99.
- Currently U+00B7 is not classified into any specific type, simply because U+00B7 is not used in Japanese texts.
- In CLreq, U+00B7 in Chinese texts plays a role similar to U+30FB in Japanese texts, so we guess it would be desirable to add U+00B7 to TYPE 3.
Move 〈：〉U+FF1A and 〈；〉U+FF1B to the type of 〈，〉U+FF0C ?
- It seems that U+FF1A and U+FF1B are positioned left-aligned compared to adjacent Han characters in Chinese fonts like fandol.
- Currently thse characters are classified into the type of 〈・〉 U+30FB, simply because they are positioned center-aligned in Japanese fonts. This results in too much space at the right of these characters.
- If these two characters are designed left-aligned in most Chinese fonts, then it might be desirable to move them to TYPE 2.
Add〈？〉U+FF1F and 〈！〉U+FF01 to the type of 〈。〉U+3002 ?
- Similar reason to the above, however, this might be unsafe; in FandolSong, these two characters are left-aligned. On the other hand, in FandolHei, 〈？〉 is center-aligned (that means, it does not fit in halfwidth).

All these changes have been impossible due to the old implementation of makejvf (Japanese virtual font generator for pTeX/upTeX), whose hard-coded routine is optimized for Japanese fonts. Yesterday I added “enhanced mode” to makejvf (texlive svn r44817), which can output more suitable VF for properly-classified JFM types.

Man-Ting-Fang · 2017-07-18T17:34:07Z

〈：〉, 〈；〉, 〈！〉 and 〈？〉 should be treated differently in horizontal and vertical writing mode. Each of them has a "width" of 0.5zw in horizontal writing mode but 1zw in vertical writing mode.

〈：〉 and 〈；〉 may also have a "width" of 0.5zw in vertical writing mode in some fonts, but if we assume that they have a "width" of 1zw, our font metrics can work for both situations.

Man-Ting-Fang · 2017-07-18T17:46:47Z

〈：〉, 〈；〉, 〈！〉 and 〈？〉 are all left-aligned in simplified Chinese, cf. GBT 15834-2011. Fandol fonts (and some other simplified Chinese fonts) are wrong. SourceHanSerif / SourceHanSans fonts are right.

aminophen · 2017-07-21T15:05:38Z

Thanks for your comments.

〈：〉, 〈；〉, 〈！〉 and 〈？〉 should be treated differently in horizontal and vertical writing mode. Each of them has a "width" of 0.5zw in horizontal writing mode but 1zw in vertical writing mode.

Actually I don't know what is the desired output for Chinese vertical writing; could you show me a sample page of Chinese books printed in vertical writing? I'd like to see how these punctuations appear in vertical writing.

aminophen · 2017-07-21T15:06:42Z

〈：〉, 〈；〉, 〈！〉 and 〈？〉 are all left-aligned in simplified Chinese, cf. GBT 15834-2011. Fandol fonts (and some other simplified Chinese fonts) are wrong. SourceHanSerif / SourceHanSans fonts are right.

This info is also helpful; Thank you so much!

Man-Ting-Fang · 2017-07-22T09:06:35Z

Many photos of old books can be found on this website:

http://www.kongfz.com/

For example, there are some photos of 《三國演義》; these books were printed in the early 1950s:

http://book.kongfz.com/item_pic_191860_617117234/
http://book.kongfz.com/item_pic_10445_645749682/
http://book.kongfz.com/item_pic_89175_676091358/
http://book.kongfz.com/item_pic_41666_553943157/

In fact, the output of Chinese punctuations depends on two "style"s: the font design style and the punctuation compression style. The font design style decides how the punctuation is positioned in the square (character box) and the punctuation compression style decides the spacing between punctuations' and/or Han characters' square. Both of them have several possibilities and there are several possible combinations of them. (This is true for both traditional and simplified Chinese; the centre-aligned style of traditional Chinese in Taiwan is merely one case.)

Unfortunately, as far as I know, there is still no package takes care of both of the two styles in the TeX world; instead, only the punctuation compression style is taken into consideration. Clerk Ma said that he was working on this problem (with a new method), but his work is still not published. I'm writing a package for (e-)upTeX and ApTeX and try to resolve this problem with the traditional method (JFM and JVF).

But on the other hand, I think there is no necessity to support all those possibilities in uptex-fonts: The most common one --- the one that we discussed and saw in the photos above --- and the centre-aligned style (for traditional Chinese in Taiwan) are enough.

aminophen · 2017-07-22T10:20:28Z

Many photos of old books can be found on this website:

Great! Now I see clearly how punctuations appear in vertical writing.

In fact, the output of Chinese punctuations depends on two "style"s: the font design style and the punctuation compression style. [...] Both of them have several possibilities and there are several possible combinations of them.

Thanks for clarification.

Unfortunately, as far as I know, there is still no package takes care of both of the two styles in the TeX world; instead, only the punctuation compression style is taken into consideration. [...] I'm writing a package for (e-)upTeX and ApTeX and try to resolve this problem with the traditional method (JFM and JVF).

That'll be great. Please let me know if I can help you ;-)

But on the other hand, I think there is no necessity to support all those possibilities in uptex-fonts: The most common one --- the one that we discussed and saw in the photos above --- and the centre-aligned style (for traditional Chinese in Taiwan) are enough.

Yes, our plan is similar, except that we may include two "style"s for traditional Chinese to support two font design styles. This is because centre-aligned style (which is going to be named "uptchc*" series) is the most suitable for traditional Chinese but we have distributed the left-aligned style (existing "uptch*" series) for almost 10 years. As a result, there will be

"upsch*" series for simplified Chinese (-> update might be followed by zhmetrics-uptex ???)
"uptch*" series for traditional Chinese, left-aligned punctuations
"uptchc*" series for traditional Chinese, centre-aligned punctuations

The source files for "upsch*" and "uptch*" will be shared each other, so all we have to do is to design two JFM/JVF sets.

Man-Ting-Fang · 2017-07-22T12:20:54Z

Note of quotation marks of Chinese:

But in practice, the Taiwan style quotation marks may also be used in vertical typesetting in Mainland. For example, this is 《十三經注疏・論語注疏》 (北京大學出版社, 2000):

Even 「…『…』…」 is used in horizontal simplified Chinese on the Internet and in a few published books in Mainland (sorry, I can't remember the name of the book).

Note of middle dot of Chinese:

There are five characters may be used as middle dot: U+00B7, U+2022, U+2027, U+30FB, U+FF0E, and there is no official standard. U+00B7 is mainly used in Mainland; U+FF0E was mainly used in Taiwan (note that U+FF0E is also centre-aligned in Taiwan, so it looks like a middle dot). But I don't know which character is used by most people in Hong Kong and Macau, and in Taiwan nowadays.

U+30FB KATAKANA MIDDLE DOT [・] is tightly connected to the JIS code system, it is not recommended to use this.
--- https://www.w3.org/TR/clreq/

This is only the clreq team's opinion.

Man-Ting-Fang · 2017-07-22T12:24:43Z

That'll be great. Please let me know if I can help you ;-)

Thank you very much!

aminophen · 2017-07-22T12:50:55Z

With regard to middle dot, current upjisr-h.pl (meant for left-aligned punctuations) has only U+30FB in TYPE 3

(CHARSINTYPE O 3
   ・ ： ；
   )

and uptchcr-h.pl (meant for centre-aligned punctuations) has U+30FB and U+FF0E in TYPE 3

(CHARSINTYPE O 3
   、 ， 。 ． ・ ： ；
   )

U+00B7 is mainly used in Mainland

Then, I'll add U+00B7 to TYPE 3.

I'm not sure about U+2022 and U+2027; these are marked as "bullet" and "hyphenation point", so truncating widths and applying JFM glue around these characters like TYPE 3 might be odd ...

aminophen · 2017-07-22T13:17:38Z

I'll add U+00B7 to TYPE 3.

Done in texjporg/uptex-fonts@0f919e2. (.tfm and .vf files are not still regenerated)

Man-Ting-Fang · 2017-07-22T14:00:12Z

I'm not sure about U+2022 and U+2027; these are marked as "bullet" and "hyphenation point", so truncating widths and applying JFM glue around these characters like TYPE 3 might be odd ...

Yes, my replies are only notes, so I titled them "Note of...". Actually, there are some odder usage: use two consecutive U+2500 (U+2502) as a horizontal (vertical) long dash (破折號) and use two consecutive U+22EF (U+22EE) as a horizontal (vertical) ellipsis... Certainly, this is also a note, not a suggestion.

aminophen · 2017-07-22T14:16:04Z

Very helpful! Then I will not add them as a standard.

Move 〈：〉U+FF1A and 〈；〉U+FF1B to the type of 〈，〉U+FF0C ?

Add〈？〉U+FF1F and 〈！〉U+FF01 to the type of 〈。〉U+3002 ?

Based on this comment and this comment, I moved 〈：〉 and 〈；〉 to TYPE 2, and added 〈！〉 and 〈？〉 to TYPE 4, both of which are restricted to horizontal writing of simplified Chinese (texjporg/uptex-fonts@daa2e3c).

I regenerated JFM/JVF based on the above commits; if you are interested, please test it. I appreciate any comments.

uptex-fonts-test20170722.zip

If anyone wants to build these .vf files by yourself, makejvf -e option is required.

Man-Ting-Fang · 2017-07-26T08:05:15Z

It seems that this line should be added to ukinsoku.tex (plain TeX) and ukinsoku.dtx (LaTeX):

\prebreakpenalty"B7=10000

Typos:

The last paragraph of README_ASCII_Corp.txt:

dvip -> dvips

README_uptex_font.txt:

ambiguos -> ambiguous

lator -> later

aminophen · 2017-07-26T11:17:49Z

\prebreakpenalty"B7=10000

That change has been already planned, but I'm wondering about the possibility of side effect. ukinsoku.tex also has \xspcode"b7=3, which is meant for 8-bit encoding (like T1 encoding), so first we have to examine what happens when these kinsoku parameters become effective.

Typos

Thanks. I'll fix them soon in uptex-base repo.

aminophen · 2017-07-29T11:49:07Z

It seems that this line should be added to ukinsoku.tex (plain TeX) and ukinsoku.dtx (LaTeX):
\prebreakpenalty"B7=10000

That change has been already planned, but I'm wondering about the possibility of side effect. [...] first we have to examine what happens when these kinsoku parameters become effective.

I think I found a side effect. Consider following example, where the code point 0xB7 appears twice, from two different encodings (T1 and Unicode):

\documentclass{ujarticle}
\usepackage{lmodern}
\pagestyle{empty}
\AtBeginDvi{\special{pdf:mapline uprml-h UniJIS-UTF16-H :0:simsun.ttc}}

% \prebreakpenalty is applied to both CJK and non-CJK tokens
\prebreakpenalty"B7=10000\relax

\parindent0pt
\textwidth7zw
\begin{document}

% By default, U+00B7 is treated as non-CJK token
字字字字字字字\char"B7 abc\par  % unexpected
% By using \kchar, U+00B7 is treated as CJK token
字字字字字字字\kchar"B7 abc\par

\end{document}

Note about upTeX spec: When \char primitive is used, the character is treated as a CJK token or a non-CJK token, depending on the character code. When \kchar primitive is used, the character is always treated as a CJK token.

In this example, both line breaks before 0xB7 ("ů" and "·") are disabled, but the first one (before "ů") seems to be unexpected (though the character "ů" rarely comes at the beginning of any words).

Man-Ting-Fang · 2017-07-30T08:06:02Z

This is really a problem... If we use \char"B7, the line breaking is unexpected, but we can copy the character ů from PDF; if we use \r{u}, the line breaking is expected, but we get °u instead of ů from PDF... However, ů can be searched in both cases. (Via Adobe Reader.) So I think the real problem is to make a choice.

Incidentally, U+00B7 cannot be typeset correctly when using some fonts:

\special{pdf:mapline upstsl-h unicode SourceHanSerifSC-Regular.otf}

\font\schrmh=upschrm-h

\def\test{中文·中文·English·English·中文}

% IPAexMincho

\test

% SourceHanSerifSC-Regular

\schrmh\test

\bye

aminophen · 2017-07-30T14:24:45Z

If we use \char"B7, the line breaking is unexpected, but we can copy the character ů from PDF; if we use \r{u}, the line breaking is expected, but we get °u instead of ů from PDF... [...] So I think the real problem is to make a choice.

Unfortunately, if we add \usepackage[T1]{fontenc}, both \r{u} and \char"B7 gives the same result ů, instead of °u, and the line breaking for both are unexpected. We cannot make a choice which to use. However, as I already noted, the character ů rarely comes at the beginning of any words, so there might be no problem about \prebreakpenalty"B7=10000 for practical use.

Incidentally, U+00B7 cannot be typeset correctly when using some fonts:

That is not what we can handle; (u)pTeX assumes fixed width fonts, but the real fonts (IPAexMincho and SourceHanSerifSC-Regular) has U+00B7 in proportional width. If we are going to use such a proportional width font, we have to prepare a specific JFM for it.

aminophen · 2017-07-30T15:48:49Z

the character ů rarely comes at the beginning of any words, so there might be no problem about \prebreakpenalty"B7=10000 for practical use.

I forgot to note that some other latin encodings might have a character other than ů in 0xB7 ...

Man-Ting-Fang · 2017-07-30T16:08:16Z

Sorry, my expression was not clear. What I meant is that the problem is to choose the lesser of two evils: \prebreakpenalty"B7=10000 is suitable for Chinese, but not for languages whose alphabet contains ů (and may be some other characters), e.g., Czech; without \prebreakpenalty"B7=10000, the situation is reversed.

Thank you for your explanation!

aminophen · 2017-07-31T03:52:34Z

What I meant is that the problem is to choose the lesser of two evils

That's true... but the problem is not only about 0xB7.

Some people use latin double quotes `` and '' (charcode 92 and 34 in OT1, 16 and 17 in T1) instead of “ and ” even in CJK text, and they requested to add penalties for these latin double quotes;

\documentclass{ujarticle}
\textwidth10zw

% penalties for OT1 double quotes
\postbreakpenalty92=10000
\prebreakpenalty34=10000

\begin{document}
\parindent0zw
字字字字字字字字字``字字字字''字字字。
\end{document}

However, we rejected the request, since these penalties are invalid for T1 and others. I think the current problem for middle dot (0xB7) is similar.

aminophen · 2017-08-02T09:32:13Z

I decided to add \prebreakpenalty"B7=10000 to ukinsoku.tex (in both uptex-base and uplatex), to give priority to CJK tokens in a consistent manner.

Here is the reason; we already have following lines in ukinsoku.tex:

\postbreakpenalty`«=10000
\prebreakpenalty`»=10000

Both characters are defined in recent Japanese standard (JIS X 0213), and they are assigned to U+00AB and U+00BB (Latin-1 block) in Unicode; the situation for U+00B7 is quite similar to U+00AB and U+00BB. I guess, the only reason why we don't have U+00B7 in ukinsoku.tex is that U+00B7 is rarely used in Japan compared to U+30FB. When Chinese and Korean are taken into consideration, it seems natural to add a penalty to U+00B7 for consistency.

Man-Ting-Fang · 2017-08-02T11:14:31Z

I decided to add \prebreakpenalty"B7=10000 to ukinsoku.tex (in both uptex-base and uplatex), to give priority to CJK tokens in a consistent manner.

Good news!

Man-Ting-Fang · 2017-08-02T11:31:51Z

A personal question: What is the reason for the following lines? I think \prebreakpenalty is better.

\postbreakpenalty`\%=500
\postbreakpenalty`\&=500
\postbreakpenalty`％=200
\postbreakpenalty`＆=200

aminophen · 2017-08-02T11:39:02Z

What is the reason for the following lines? I think \prebreakpenalty is better.
\postbreakpenalty`\%=500
\postbreakpenalty`\&=500
\postbreakpenalty`％=200
\postbreakpenalty`＆=200

I don't know either ;-) These codes were originally written by ASCII Corporation in 1995 or earlier (see kinsoku.dtx in texjporg/platex) and remain unchanged. I think we should fix them ...

Man-Ting-Fang · 2017-08-03T05:55:18Z

The following lines are copied from ukinsoku, and I guess there is something wrong:

%%
%% inhibitxspcode  JIS X 0213
%%
\inhibitxspcode`¡=2
\inhibitxspcode`¿=2
%%
%% inhibitxspcode  JIS X 0212
%%
%%\inhibitxspcode`¡=1
%%\inhibitxspcode`¿=1

aminophen · 2017-08-03T10:46:38Z

\inhibitxspcode for JIS X 0212 (comment-out) seems wrong. Fixed in uptex-base and uplatex.

Please wait for \postbreakpenaly -> \prebreakpenalty fixes for % and &...

aminophen · 2017-08-04T16:36:14Z

% and &

Done in texjporg/ptex-base#5 (see commit references).

aminophen · 2017-08-08T11:47:34Z

Now I hope upschr-h.pl and upschr-v.pl are ready for Chinese; if there is someone interested, please incorporate them as upzhm-h.pl and upzhm-v.pl.

When the new zhmetrics-uptex is going to be released, I recommend to build new .vf files using makejvf -e -3 -i -u gb, not makejvf -i -u gb;

The option -3 is a flag to use code points >=U+10000, that is, non-BMP characters.
- Makejvf does not output >=U+10000 by default (because few DVI drivers supported VF with >=U+10000 in old days, especially when upTeX development started in 2007).
- However in these days, dvipdfmx and many others support VF with >=U+10000 (= in DVI language set3), so we don't have to worry about any side effect of using -3.
The option -e is a flag to set proper shift amount in VF for Chinese.
- The option -e is named "enhanced mode", which is available since makejvf 20170716 (r44817 in TeX Live 2018/dev). It reads GLUEKERN table in JFM files (= counterpart for LIGKERN table in TFM) and sets proper shift amount (= in DVI language moveright).
- Current version of makejvf will show many warnings (Conflicting MOVERIGHT value for code ...), but (I expect) it should output correct results optimized for Chinese.

aminophen · 2019-06-01T01:59:27Z

I noticed that #369 is already merged; thanks!

aminophen changed the title ~~[ahmetrics-uptex]~~ [zhmetrics-uptex] Enhancement of upTeX metrics Jul 16, 2017

leo-liu self-assigned this Jul 17, 2017

leo-liu added the feature request label Jul 17, 2017

aminophen mentioned this issue Jul 19, 2017

中韓フォントの JFM texjporg/uptex-fonts#2

Closed

aminophen mentioned this issue Jul 22, 2017

upjisr-v のコロン texjporg/uptex-fonts#5

Closed

aminophen added a commit to texjporg/uptex-fonts that referenced this issue Jul 28, 2017

README_ASCII_Corp.txt: typo (CTeX-org/ctex-kit#292)

86dbadf

aminophen added a commit to texjporg/uptex-fonts that referenced this issue Jul 28, 2017

README_uptex_font.txt: typo (CTeX-org/ctex-kit#292)

a00648d

aminophen added a commit to texjporg/uptex-base that referenced this issue Aug 2, 2017

ukinsoku.tex: Add prebreakpenalty for U+00B7 (CTeX-org/ctex-kit#292)

0dd0315

aminophen added a commit to texjporg/uplatex that referenced this issue Aug 2, 2017

ukinsoku.dtx: Add prebreakpenalty for U+00B7 (CTeX-org/ctex-kit#292)

df07888

aminophen mentioned this issue Aug 2, 2017

禁則ペナルティの設定間違い？ texjporg/ptex-base#5

Closed

aminophen mentioned this issue Aug 9, 2017

冒号和分号的位置 clerkma/ptex-ng#21

Open

aminophen mentioned this issue Jun 11, 2018

upzhm-{h,v}.pl: sync with texjporg/uptex-fonts/2018-03-28 #369

Merged

aminophen closed this as completed Jun 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zhmetrics-uptex] Enhancement of upTeX metrics #292

[zhmetrics-uptex] Enhancement of upTeX metrics #292

aminophen commented Jul 16, 2017 •

edited

Loading

Man-Ting-Fang commented Jul 18, 2017

Man-Ting-Fang commented Jul 18, 2017

aminophen commented Jul 21, 2017

aminophen commented Jul 21, 2017

Man-Ting-Fang commented Jul 22, 2017

aminophen commented Jul 22, 2017

Man-Ting-Fang commented Jul 22, 2017

Man-Ting-Fang commented Jul 22, 2017

aminophen commented Jul 22, 2017

aminophen commented Jul 22, 2017

Man-Ting-Fang commented Jul 22, 2017 •

edited

Loading

aminophen commented Jul 22, 2017

Man-Ting-Fang commented Jul 26, 2017

aminophen commented Jul 26, 2017

aminophen commented Jul 29, 2017 •

edited

Loading

Man-Ting-Fang commented Jul 30, 2017 •

edited

Loading

aminophen commented Jul 30, 2017 •

edited

Loading

aminophen commented Jul 30, 2017

Man-Ting-Fang commented Jul 30, 2017 •

edited

Loading

aminophen commented Jul 31, 2017 •

edited

Loading

aminophen commented Aug 2, 2017

Man-Ting-Fang commented Aug 2, 2017

Man-Ting-Fang commented Aug 2, 2017 •

edited

Loading

aminophen commented Aug 2, 2017

Man-Ting-Fang commented Aug 3, 2017 •

edited

Loading

aminophen commented Aug 3, 2017 •

edited

Loading

aminophen commented Aug 4, 2017

aminophen commented Aug 8, 2017 •

edited

Loading

aminophen commented Jun 1, 2019

[zhmetrics-uptex] Enhancement of upTeX metrics #292

[zhmetrics-uptex] Enhancement of upTeX metrics #292

Comments

aminophen commented Jul 16, 2017 • edited Loading

Man-Ting-Fang commented Jul 18, 2017

Man-Ting-Fang commented Jul 18, 2017

aminophen commented Jul 21, 2017

aminophen commented Jul 21, 2017

Man-Ting-Fang commented Jul 22, 2017

aminophen commented Jul 22, 2017

Man-Ting-Fang commented Jul 22, 2017

Man-Ting-Fang commented Jul 22, 2017

aminophen commented Jul 22, 2017

aminophen commented Jul 22, 2017

Man-Ting-Fang commented Jul 22, 2017 • edited Loading

aminophen commented Jul 22, 2017

Man-Ting-Fang commented Jul 26, 2017

aminophen commented Jul 26, 2017

aminophen commented Jul 29, 2017 • edited Loading

Man-Ting-Fang commented Jul 30, 2017 • edited Loading

aminophen commented Jul 30, 2017 • edited Loading

aminophen commented Jul 30, 2017

Man-Ting-Fang commented Jul 30, 2017 • edited Loading

aminophen commented Jul 31, 2017 • edited Loading

aminophen commented Aug 2, 2017

Man-Ting-Fang commented Aug 2, 2017

Man-Ting-Fang commented Aug 2, 2017 • edited Loading

aminophen commented Aug 2, 2017

Man-Ting-Fang commented Aug 3, 2017 • edited Loading

aminophen commented Aug 3, 2017 • edited Loading

aminophen commented Aug 4, 2017

aminophen commented Aug 8, 2017 • edited Loading

aminophen commented Jun 1, 2019

aminophen commented Jul 16, 2017 •

edited

Loading

Man-Ting-Fang commented Jul 22, 2017 •

edited

Loading

aminophen commented Jul 29, 2017 •

edited

Loading

Man-Ting-Fang commented Jul 30, 2017 •

edited

Loading

aminophen commented Jul 30, 2017 •

edited

Loading

Man-Ting-Fang commented Jul 30, 2017 •

edited

Loading

aminophen commented Jul 31, 2017 •

edited

Loading

Man-Ting-Fang commented Aug 2, 2017 •

edited

Loading

Man-Ting-Fang commented Aug 3, 2017 •

edited

Loading

aminophen commented Aug 3, 2017 •

edited

Loading

aminophen commented Aug 8, 2017 •

edited

Loading