Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[zhmetrics-uptex] Enhancement of upTeX metrics #292

Closed
aminophen opened this issue Jul 16, 2017 · 29 comments
Closed

[zhmetrics-uptex] Enhancement of upTeX metrics #292

aminophen opened this issue Jul 16, 2017 · 29 comments
Assignees

Comments

@aminophen
Copy link
Contributor

aminophen commented Jul 16, 2017

Hello, this is Japanese TeX Development Community.

We are considering about improving standard upTeX metrics for simplified Chinese, bundled in https://github.com/texjporg/uptex-fonts . We find that CTeX-org has a derivative called zhmetrcis-uptex, so I decided to ask you what is the most desirable.

When revisiting currently available metrics (upschrm-{h,v}, upschgt-{h,v}), the following questions are raised;

  • Add 〈·〉U+00B7 MIDDLE DOT to the type of 〈・〉U+30FB KATAKANA MIDDLE DOT ?
    • U+00B7 is included in the common Chinese character set, GB2312 01-04 and GBK 0xA1A4. In UniGB-UTF16-H, U+00B7 is mapped to Adobe-GB1 CID+99.
    • Currently U+00B7 is not classified into any specific type, simply because U+00B7 is not used in Japanese texts.
    • In CLreq, U+00B7 in Chinese texts plays a role similar to U+30FB in Japanese texts, so we guess it would be desirable to add U+00B7 to TYPE 3.
  • Move 〈:〉U+FF1A and 〈;〉U+FF1B to the type of 〈,〉U+FF0C ?
    • It seems that U+FF1A and U+FF1B are positioned left-aligned compared to adjacent Han characters in Chinese fonts like fandol.
    • Currently thse characters are classified into the type of 〈・〉 U+30FB, simply because they are positioned center-aligned in Japanese fonts. This results in too much space at the right of these characters.
    • If these two characters are designed left-aligned in most Chinese fonts, then it might be desirable to move them to TYPE 2.
  • Add〈?〉U+FF1F and 〈!〉U+FF01 to the type of 〈。〉U+3002 ?
    • Similar reason to the above, however, this might be unsafe; in FandolSong, these two characters are left-aligned. On the other hand, in FandolHei, 〈?〉 is center-aligned (that means, it does not fit in halfwidth).

All these changes have been impossible due to the old implementation of makejvf (Japanese virtual font generator for pTeX/upTeX), whose hard-coded routine is optimized for Japanese fonts. Yesterday I added “enhanced mode” to makejvf (texlive svn r44817), which can output more suitable VF for properly-classified JFM types.

@aminophen aminophen changed the title [ahmetrics-uptex] [zhmetrics-uptex] Enhancement of upTeX metrics Jul 16, 2017
@leo-liu leo-liu self-assigned this Jul 17, 2017
@Man-Ting-Fang
Copy link

〈:〉, 〈;〉, 〈!〉 and 〈?〉 should be treated differently in horizontal and vertical writing mode. Each of them has a "width" of 0.5zw in horizontal writing mode but 1zw in vertical writing mode.

〈:〉 and 〈;〉 may also have a "width" of 0.5zw in vertical writing mode in some fonts, but if we assume that they have a "width" of 1zw, our font metrics can work for both situations.

pic

@Man-Ting-Fang
Copy link

〈:〉, 〈;〉, 〈!〉 and 〈?〉 are all left-aligned in simplified Chinese, cf. GBT 15834-2011. Fandol fonts (and some other simplified Chinese fonts) are wrong. SourceHanSerif / SourceHanSans fonts are right.

@aminophen
Copy link
Contributor Author

Thanks for your comments.

〈:〉, 〈;〉, 〈!〉 and 〈?〉 should be treated differently in horizontal and vertical writing mode. Each of them has a "width" of 0.5zw in horizontal writing mode but 1zw in vertical writing mode.

Actually I don't know what is the desired output for Chinese vertical writing; could you show me a sample page of Chinese books printed in vertical writing? I'd like to see how these punctuations appear in vertical writing.

@aminophen
Copy link
Contributor Author

〈:〉, 〈;〉, 〈!〉 and 〈?〉 are all left-aligned in simplified Chinese, cf. GBT 15834-2011. Fandol fonts (and some other simplified Chinese fonts) are wrong. SourceHanSerif / SourceHanSans fonts are right.

This info is also helpful; Thank you so much!

@Man-Ting-Fang
Copy link

Many photos of old books can be found on this website:

http://www.kongfz.com/

For example, there are some photos of 《三國演義》; these books were printed in the early 1950s:

http://book.kongfz.com/item_pic_191860_617117234/
http://book.kongfz.com/item_pic_10445_645749682/
http://book.kongfz.com/item_pic_89175_676091358/
http://book.kongfz.com/item_pic_41666_553943157/

In fact, the output of Chinese punctuations depends on two "style"s: the font design style and the punctuation compression style. The font design style decides how the punctuation is positioned in the square (character box) and the punctuation compression style decides the spacing between punctuations' and/or Han characters' square. Both of them have several possibilities and there are several possible combinations of them. (This is true for both traditional and simplified Chinese; the centre-aligned style of traditional Chinese in Taiwan is merely one case.)

Unfortunately, as far as I know, there is still no package takes care of both of the two styles in the TeX world; instead, only the punctuation compression style is taken into consideration. Clerk Ma said that he was working on this problem (with a new method), but his work is still not published. I'm writing a package for (e-)upTeX and ApTeX and try to resolve this problem with the traditional method (JFM and JVF).

But on the other hand, I think there is no necessity to support all those possibilities in uptex-fonts: The most common one --- the one that we discussed and saw in the photos above --- and the centre-aligned style (for traditional Chinese in Taiwan) are enough.

@aminophen
Copy link
Contributor Author

Many photos of old books can be found on this website:

Great! Now I see clearly how punctuations appear in vertical writing.

In fact, the output of Chinese punctuations depends on two "style"s: the font design style and the punctuation compression style. [...] Both of them have several possibilities and there are several possible combinations of them.

Thanks for clarification.

Unfortunately, as far as I know, there is still no package takes care of both of the two styles in the TeX world; instead, only the punctuation compression style is taken into consideration. [...] I'm writing a package for (e-)upTeX and ApTeX and try to resolve this problem with the traditional method (JFM and JVF).

That'll be great. Please let me know if I can help you ;-)

But on the other hand, I think there is no necessity to support all those possibilities in uptex-fonts: The most common one --- the one that we discussed and saw in the photos above --- and the centre-aligned style (for traditional Chinese in Taiwan) are enough.

Yes, our plan is similar, except that we may include two "style"s for traditional Chinese to support two font design styles. This is because centre-aligned style (which is going to be named "uptchc*" series) is the most suitable for traditional Chinese but we have distributed the left-aligned style (existing "uptch*" series) for almost 10 years. As a result, there will be

  • "upsch*" series for simplified Chinese (-> update might be followed by zhmetrics-uptex ???)
  • "uptch*" series for traditional Chinese, left-aligned punctuations
  • "uptchc*" series for traditional Chinese, centre-aligned punctuations

The source files for "upsch*" and "uptch*" will be shared each other, so all we have to do is to design two JFM/JVF sets.

@Man-Ting-Fang
Copy link

Note of quotation marks of Chinese:

qm

But in practice, the Taiwan style quotation marks may also be used in vertical typesetting in Mainland. For example, this is 《十三經注疏・論語注疏》 (北京大學出版社, 2000):

twvsml

Even 「…『…』…」 is used in horizontal simplified Chinese on the Internet and in a few published books in Mainland (sorry, I can't remember the name of the book).

Note of middle dot of Chinese:

There are five characters may be used as middle dot: U+00B7, U+2022, U+2027, U+30FB, U+FF0E, and there is no official standard. U+00B7 is mainly used in Mainland; U+FF0E was mainly used in Taiwan (note that U+FF0E is also centre-aligned in Taiwan, so it looks like a middle dot). But I don't know which character is used by most people in Hong Kong and Macau, and in Taiwan nowadays.

U+30FB KATAKANA MIDDLE DOT [・] is tightly connected to the JIS code system, it is not recommended to use this.
--- https://www.w3.org/TR/clreq/

This is only the clreq team's opinion.

@Man-Ting-Fang
Copy link

That'll be great. Please let me know if I can help you ;-)

Thank you very much!

@aminophen
Copy link
Contributor Author

With regard to middle dot, current upjisr-h.pl (meant for left-aligned punctuations) has only U+30FB in TYPE 3

(CHARSINTYPE O 3
   ・ : ;
   )

and uptchcr-h.pl (meant for centre-aligned punctuations) has U+30FB and U+FF0E in TYPE 3

(CHARSINTYPE O 3
   、 , 。 . ・ : ;
   )

U+00B7 is mainly used in Mainland

Then, I'll add U+00B7 to TYPE 3.

I'm not sure about U+2022 and U+2027; these are marked as "bullet" and "hyphenation point", so truncating widths and applying JFM glue around these characters like TYPE 3 might be odd ...

@aminophen
Copy link
Contributor Author

I'll add U+00B7 to TYPE 3.

Done in texjporg/uptex-fonts@0f919e2. (.tfm and .vf files are not still regenerated)

@Man-Ting-Fang
Copy link

Man-Ting-Fang commented Jul 22, 2017

I'm not sure about U+2022 and U+2027; these are marked as "bullet" and "hyphenation point", so truncating widths and applying JFM glue around these characters like TYPE 3 might be odd ...

Yes, my replies are only notes, so I titled them "Note of...". Actually, there are some odder usage: use two consecutive U+2500 (U+2502) as a horizontal (vertical) long dash (破折號) and use two consecutive U+22EF (U+22EE) as a horizontal (vertical) ellipsis... Certainly, this is also a note, not a suggestion.

@aminophen
Copy link
Contributor Author

Very helpful! Then I will not add them as a standard.

  • Move 〈:〉U+FF1A and 〈;〉U+FF1B to the type of 〈,〉U+FF0C ?
  • Add〈?〉U+FF1F and 〈!〉U+FF01 to the type of 〈。〉U+3002 ?

Based on this comment and this comment, I moved 〈:〉 and 〈;〉 to TYPE 2, and added 〈!〉 and 〈?〉 to TYPE 4, both of which are restricted to horizontal writing of simplified Chinese (texjporg/uptex-fonts@daa2e3c).


I regenerated JFM/JVF based on the above commits; if you are interested, please test it. I appreciate any comments.

If anyone wants to build these .vf files by yourself, makejvf -e option is required.

@Man-Ting-Fang
Copy link

It seems that this line should be added to ukinsoku.tex (plain TeX) and ukinsoku.dtx (LaTeX):

\prebreakpenalty"B7=10000

Typos:

The last paragraph of README_ASCII_Corp.txt:

dvip -> dvips

README_uptex_font.txt:

ambiguos -> ambiguous

lator -> later

@aminophen
Copy link
Contributor Author

\prebreakpenalty"B7=10000

That change has been already planned, but I'm wondering about the possibility of side effect. ukinsoku.tex also has \xspcode"b7=3, which is meant for 8-bit encoding (like T1 encoding), so first we have to examine what happens when these kinsoku parameters become effective.

Typos

Thanks. I'll fix them soon in uptex-base repo.

aminophen added a commit to texjporg/uptex-fonts that referenced this issue Jul 28, 2017
aminophen added a commit to texjporg/uptex-fonts that referenced this issue Jul 28, 2017
@aminophen
Copy link
Contributor Author

aminophen commented Jul 29, 2017

It seems that this line should be added to ukinsoku.tex (plain TeX) and ukinsoku.dtx (LaTeX):
\prebreakpenalty"B7=10000

That change has been already planned, but I'm wondering about the possibility of side effect. [...] first we have to examine what happens when these kinsoku parameters become effective.

I think I found a side effect. Consider following example, where the code point 0xB7 appears twice, from two different encodings (T1 and Unicode):

\documentclass{ujarticle}
\usepackage{lmodern}
\pagestyle{empty}
\AtBeginDvi{\special{pdf:mapline uprml-h UniJIS-UTF16-H :0:simsun.ttc}}

% \prebreakpenalty is applied to both CJK and non-CJK tokens
\prebreakpenalty"B7=10000\relax

\parindent0pt
\textwidth7zw
\begin{document}

% By default, U+00B7 is treated as non-CJK token
字字字字字字字\char"B7 abc\par  % unexpected
% By using \kchar, U+00B7 is treated as CJK token
字字字字字字字\kchar"B7 abc\par

\end{document}

Note about upTeX spec: When \char primitive is used, the character is treated as a CJK token or a non-CJK token, depending on the character code. When \kchar primitive is used, the character is always treated as a CJK token.

In this example, both line breaks before 0xB7 ("ů" and "·") are disabled, but the first one (before "ů") seems to be unexpected (though the character "ů" rarely comes at the beginning of any words).

@Man-Ting-Fang
Copy link

Man-Ting-Fang commented Jul 30, 2017

This is really a problem... If we use \char"B7, the line breaking is unexpected, but we can copy the character ů from PDF; if we use \r{u}, the line breaking is expected, but we get °u instead of ů from PDF... However, ů can be searched in both cases. (Via Adobe Reader.) So I think the real problem is to make a choice.

Incidentally, U+00B7 cannot be typeset correctly when using some fonts:

\special{pdf:mapline upstsl-h unicode SourceHanSerifSC-Regular.otf}

\font\schrmh=upschrm-h

\def\test{中文·中文·English·English·中文}

% IPAexMincho

\test

% SourceHanSerifSC-Regular

\schrmh\test

\bye

@aminophen
Copy link
Contributor Author

aminophen commented Jul 30, 2017

If we use \char"B7, the line breaking is unexpected, but we can copy the character ů from PDF; if we use \r{u}, the line breaking is expected, but we get °u instead of ů from PDF... [...] So I think the real problem is to make a choice.

Unfortunately, if we add \usepackage[T1]{fontenc}, both \r{u} and \char"B7 gives the same result ů, instead of °u, and the line breaking for both are unexpected. We cannot make a choice which to use. However, as I already noted, the character ů rarely comes at the beginning of any words, so there might be no problem about \prebreakpenalty"B7=10000 for practical use.

Incidentally, U+00B7 cannot be typeset correctly when using some fonts:

That is not what we can handle; (u)pTeX assumes fixed width fonts, but the real fonts (IPAexMincho and SourceHanSerifSC-Regular) has U+00B7 in proportional width. If we are going to use such a proportional width font, we have to prepare a specific JFM for it.

@aminophen
Copy link
Contributor Author

the character ů rarely comes at the beginning of any words, so there might be no problem about \prebreakpenalty"B7=10000 for practical use.

I forgot to note that some other latin encodings might have a character other than ů in 0xB7 ...

@Man-Ting-Fang
Copy link

Man-Ting-Fang commented Jul 30, 2017

Sorry, my expression was not clear. What I meant is that the problem is to choose the lesser of two evils: \prebreakpenalty"B7=10000 is suitable for Chinese, but not for languages whose alphabet contains ů (and may be some other characters), e.g., Czech; without \prebreakpenalty"B7=10000, the situation is reversed.

Thank you for your explanation!

@aminophen
Copy link
Contributor Author

aminophen commented Jul 31, 2017

What I meant is that the problem is to choose the lesser of two evils

That's true... but the problem is not only about 0xB7.

Some people use latin double quotes `` and '' (charcode 92 and 34 in OT1, 16 and 17 in T1) instead of and even in CJK text, and they requested to add penalties for these latin double quotes;

\documentclass{ujarticle}
\textwidth10zw

% penalties for OT1 double quotes
\postbreakpenalty92=10000
\prebreakpenalty34=10000

\begin{document}
\parindent0zw
字字字字字字字字字``字字字字''字字字。
\end{document}

However, we rejected the request, since these penalties are invalid for T1 and others. I think the current problem for middle dot (0xB7) is similar.

@aminophen
Copy link
Contributor Author

I decided to add \prebreakpenalty"B7=10000 to ukinsoku.tex (in both uptex-base and uplatex), to give priority to CJK tokens in a consistent manner.

Here is the reason; we already have following lines in ukinsoku.tex:

\postbreakpenalty`«=10000
\prebreakpenalty`»=10000

Both characters are defined in recent Japanese standard (JIS X 0213), and they are assigned to U+00AB and U+00BB (Latin-1 block) in Unicode; the situation for U+00B7 is quite similar to U+00AB and U+00BB. I guess, the only reason why we don't have U+00B7 in ukinsoku.tex is that U+00B7 is rarely used in Japan compared to U+30FB. When Chinese and Korean are taken into consideration, it seems natural to add a penalty to U+00B7 for consistency.

@Man-Ting-Fang
Copy link

I decided to add \prebreakpenalty"B7=10000 to ukinsoku.tex (in both uptex-base and uplatex), to give priority to CJK tokens in a consistent manner.

Good news!

@Man-Ting-Fang
Copy link

Man-Ting-Fang commented Aug 2, 2017

A personal question: What is the reason for the following lines? I think \prebreakpenalty is better.

\postbreakpenalty`\%=500
\postbreakpenalty`\&=500
\postbreakpenalty`%=200
\postbreakpenalty`&=200

@aminophen
Copy link
Contributor Author

What is the reason for the following lines? I think \prebreakpenalty is better.

\postbreakpenalty`\%=500
\postbreakpenalty`\&=500
\postbreakpenalty`%=200
\postbreakpenalty`&=200

I don't know either ;-) These codes were originally written by ASCII Corporation in 1995 or earlier (see kinsoku.dtx in texjporg/platex) and remain unchanged. I think we should fix them ...

@Man-Ting-Fang
Copy link

Man-Ting-Fang commented Aug 3, 2017

The following lines are copied from ukinsoku, and I guess there is something wrong:

%%
%% inhibitxspcode  JIS X 0213
%%
\inhibitxspcode`¡=2
\inhibitxspcode`¿=2
%%
%% inhibitxspcode  JIS X 0212
%%
%%\inhibitxspcode`¡=1
%%\inhibitxspcode`¿=1

@aminophen
Copy link
Contributor Author

aminophen commented Aug 3, 2017

\inhibitxspcode for JIS X 0212 (comment-out) seems wrong. Fixed in uptex-base and uplatex.

Please wait for \postbreakpenaly -> \prebreakpenalty fixes for % and &...

@aminophen
Copy link
Contributor Author

% and &

Done in texjporg/ptex-base#5 (see commit references).

@aminophen
Copy link
Contributor Author

aminophen commented Aug 8, 2017

Now I hope upschr-h.pl and upschr-v.pl are ready for Chinese; if there is someone interested, please incorporate them as upzhm-h.pl and upzhm-v.pl.

When the new zhmetrics-uptex is going to be released, I recommend to build new .vf files using makejvf -e -3 -i -u gb, not makejvf -i -u gb;

  • The option -3 is a flag to use code points >=U+10000, that is, non-BMP characters.
    • Makejvf does not output >=U+10000 by default (because few DVI drivers supported VF with >=U+10000 in old days, especially when upTeX development started in 2007).
    • However in these days, dvipdfmx and many others support VF with >=U+10000 (= in DVI language set3), so we don't have to worry about any side effect of using -3.
  • The option -e is a flag to set proper shift amount in VF for Chinese.
    • The option -e is named "enhanced mode", which is available since makejvf 20170716 (r44817 in TeX Live 2018/dev). It reads GLUEKERN table in JFM files (= counterpart for LIGKERN table in TFM) and sets proper shift amount (= in DVI language moveright).
    • Current version of makejvf will show many warnings (Conflicting MOVERIGHT value for code ...), but (I expect) it should output correct results optimized for Chinese.

@aminophen
Copy link
Contributor Author

I noticed that #369 is already merged; thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants