[css-text][text-spacing] Extra spacing between ideographs and non-fullwidth punctuation/symbols #9479

xfq · 2023-10-17T05:03:14Z

In the process of trying Chromium's implementation of text-autospace, some interested Chinese developers found an issue: there is no extra spacing between ideographs and non-fullwidth punctuation/symbols. In many cases, this results in unbalanced spacing around embedded non-CJK text in CJK languages.

Examples:

input[type="text"]选择器将选择所有type属性为text的input元素。

在HTML中按语言修改样式的最佳方法是使用CSS的:lang()选择器。

C#是微软公司发布的一种由C和C++衍生出来的面向对象的编程语言，运行于.NET Framework和.NET Core之上。

使用!important是一个坏习惯，应该尽量避免。

我们可以用@font-face来指定自定义字体。

可选链?.是一种访问嵌套对象属性的安全的方式。即使中间的属性不存在，也不会出现错误。

符号^和符号$在正则表达式中具有特殊的含义。

正则表达式中的\b表示词边界。

42%代表百分之四十二，1‰代表千分之一，|a|=2代表a的实际值是±2

{1,2}是{1,2,3}的子集。

在三亚15℃太冷了！ (U+2103 instead of U+00B0 + U+0043)

However, not all non-fullwidth punctuation/symbols require extra spacing. For example, footnote marks like *, †, ‡, and ◊ should not have the extra spacing.

Should we add a new value ideograph-symbol (the name and specific design can be discussed later) to cover this situation? This value may not cover all situations, but it can cover some common ones. For uncommon cases, it would be nice to have a mechanism for author customization.

The text was updated successfully, but these errors were encountered:

frivoal · 2023-10-23T01:51:59Z

Hmm, interesting. Many of the use cases you showed above look like things that belong in a <code> element. For those, I'd suggest taking advantage of this spec requirement:

At element boundaries, the amount of extra spacing introduced between characters is determined by and rendered within the innermost element that contains the boundary.

and doing something like this:

code {
    text-autospace: no-autospace;
    padding: 0 0.125em;
}

But not all fit in that pattern.

在三亚15℃太冷了！ (U+2103 instead of U+00B0 + U+0043)

This suggests that:

maybe we should operate on the NFD form, or
maybe we should include the letter-like symbols in the definition of non-ideographic letters.

For the rest:

C#
C++
.NET Framework
42%
1‰

Should we add a new value ideograph-symbol

Maybe? That could be a solution.

Is this a case of symbols that must always be autospaced (when autospacing is on)? If so, we should probably just do it.

Does it depend on something which the author is aware of, but that the user agent cannot easily infer? if so, a new value ideograph-symbol is probably the solution.

Does it depend on whether they're next to a string of non-ideographic letters/numbers? If so, it might suggest we need to treat the as some kind of ambiguous/neutral group, that gets grouped together with a string of non-ideographic letters/numbers if any is there, but doesn't introduce spacing by itself if found without non-ideographic letters/numbers

For example, If 永 represents ideographs, a represents non-ideographic letters, + represents neutrals (like #, +, %, ., etc), and _ represents autospacing:

永a永 would result in 永_a_永
永+永 would result in 永+永
永+a永 would result in 永_+a_永
永a+永 would result in 永_a+_永
永+a+永 would result in 永_+a+_永
永+永a+永 would result in 永+永_a+_永

Also, regardless of how we handle that category, as you mentioned that not all symbols would fit into that category, I am a little unsure about how we'd go about maintaining the list of those that do and those that don't.

kojiishi · 2023-10-23T05:17:25Z

Note this was raised to JLTF a while ago but it didn't get much attentions there. I'll ping again.

xfq · 2023-10-23T07:10:19Z

Hmm, interesting. Many of the use cases you showed above look like things that belong in a <code> element. For those, I'd suggest taking advantage of this spec requirement:

At element boundaries, the amount of extra spacing introduced between characters is determined by and rendered within the innermost element that contains the boundary.

and doing something like this:
code {
    text-autospace: no-autospace;
    padding: 0 0.125em;
}

Why is it rendered within the innermost element that contains the boundary (i.e. padding) instead of margin? If there is a background color in the code element, I think what I would expect to see is that the background in the extra spacing is not filled with background color.

But not all fit in that pattern.

在三亚15℃太冷了！ (U+2103 instead of U+00B0 + U+0043)

This suggests that:

maybe we should operate on the NFD form, or

Maybe. I don't have a counterexample now.

maybe we should include the letter-like symbols in the definition of non-ideographic letters.

Maybe, but I'm not quite sure about code points like U+2122 (Trade Mark Sign). I personally don't think the extra spacing is needed for it, but I would like to discuss it with the clreq group.

For the rest:

C#
C++
.NET Framework
42%
1‰

Should we add a new value ideograph-symbol

Maybe? That could be a solution.

Is this a case of symbols that must always be autospaced (when autospacing is on)? If so, we should probably just do it.

Does it depend on something which the author is aware of, but that the user agent cannot easily infer? if so, a new value ideograph-symbol is probably the solution.

Does it depend on whether they're next to a string of non-ideographic letters/numbers? If so, it might suggest we need to treat the as some kind of ambiguous/neutral group, that gets grouped together with a string of non-ideographic letters/numbers if any is there, but doesn't introduce spacing by itself if found without non-ideographic letters/numbers

For example, If 永 represents ideographs, a represents non-ideographic letters, + represents neutrals (like #, +, %, ., etc), and _ represents autospacing:

永a永 would result in 永_a_永

永+永 would result in 永+永

永+a永 would result in 永_+a_永

永a+永 would result in 永_a+_永

永+a+永 would result in 永_+a+_永

永+永a+永 would result in 永+永_a+_永

Also, regardless of how we handle that category, as you mentioned that not all symbols would fit into that category, I am a little unsure about how we'd go about maintaining the list of those that do and those that don't.

I agree that sometimes there is ambiguity, and I'll discuss it with the clreq group.

frivoal · 2023-10-23T07:12:39Z

Why is it rendered within the innermost element that contains the boundary (i.e. padding) instead of margin?

No particular reason, authors could do what they prefer. I guess my choice here was influenced by the default GitHub style which includes some inline padding in <code> elements.

kojiishi · 2023-10-24T05:41:24Z

/cc @Clqsin45 @nt1m @vitorroriz

Clqsin45 · 2023-10-25T17:30:30Z

Is it possible to somewhat involve UNICODE TEXT SEGMENTATION?
I think many of the examples indicate that user perceptions can be different from predefined symbols in the real world, and it is hard to figure out a perfect solution, as it is natural language which can never have a 100% correct algorithm.

Fortunately they are usually consecutive, so I guess SEGMENTATION, or the grouping logic mentioned by #9479 (comment) , should improve the situation.

xfq · 2023-10-26T02:57:46Z

Could you provide an example of how to use UAX #29 for this use case? Are you referring to the word break algorithm or something else?

yisibl · 2023-11-09T09:18:42Z

Is this a case of symbols that must always be autospaced (when autospacing is on)?

@frivoal Yes!

Considering that Chrome is in the process of implementing text-autospace, and in order to provide better default typography before it ships, I suggest that the specification, at the current level, only consider adding ideograph-symbol. This value will by default only add symbols that are common in Natural language.

Temperature symbols: ℃（U+2103）, ℉（U+2109）, °
Math symbols: %, ‰, ‱ (U+2031), +, -(U+002D), −(U+2212), ±, ∓
Currency symbols
Letterlike Symbols: It may be necessary to pick only some of these symbols.

It looks like Apple's OS takes a similar approach, for example:

中心城区在-15至-20℃之间。（From here）
性能提升25%以上。
C#是一种由C和C++衍生出来的面向对象的Smashing语言，运行于.NET Framework和.NET Core之上。

@fantasai Do you know the exact rules for adding space in iOS?

In the absence of a suitable algorithm, in the future it might be worth considering using a @counter-style-like syntax to customize the rules.

@kojiishi If the specification defines a rule for this, would you prioritize implementing it?

kojiishi · 2023-11-09T17:37:06Z

Including symbols makes sense to me, but we probably don't want to include all gc=S*, do we? We'll need to review which one to include and which one not to. During that, we'll need to make sure it doesn't insert spacing to where we don't expect.

By seeing multilpe feedback to the character classes coming up, I'm leaning towards moving this definition to Unicode as I commented on PR#9503. Doing so should make discussing with Unicode experts easier, and maintaining the list should be easier too.

Regarding the syntax, as several issues coming up and there are some uncertainty, I think it's better to step back rather than adding more. One idea is including them to both sets without adding a new value. Another idea is to defer detailed classifications of letters and numerals to future versions and start with normal only (IIUC that's what iOS/macOS does.) There may be more ways, but stepping back will allow us to think about designs more after impls ship and hearing the web authors feedback.

/cc @nt1m @vitorroriz @Clqsin45 @kidayasuo

kidayasuo · 2023-11-10T01:02:18Z

I surely believe I am missing some important points, but what is the cause of this oddity?

some interested Chinese developers found an issue: there is no extra spacing between ideographs and non-fullwidth punctuation/symbols

With the text-autospace: normal property, I thought a small space would be generated between 'ideographs' and characters that are not. This two-state machine should prevent the imbalance that was mentioned. I apologize for the interruption, but I would greatly appreciate it if you could clarify where my misunderstanding lies.

xfq · 2023-11-10T02:47:42Z

@kidayasuo The current default behaviour is ideograph-alpha ideograph-numeric, meaning there is only extra spacing between ideographs and non-ideographic letters/numerals, but there's no extra spacing between ideographs and non-fullwidth punctuation/symbols.

For example, there's no extra spacing for the colon (:), parentheses, "hash sign" (#) and plus signs (+) and the ideograph next to them in the picture below:

yisibl · 2023-11-10T03:39:51Z

Another idea is to defer detailed classifications of letters and numerals to future versions and start with normal only (IIUC that's what iOS/macOS does.)

@kojiishi Apple's normal adds more symbols, such as space after % in iOS screenshots. This requires them to share the exact rules.

kidayasuo · 2023-11-10T06:06:52Z

@xfq Thank you. Got it. Do you know why ideograph-alpha and ideograph-numeric are created when "non-ideograph" might be all what you need? I can't think of scenarios where you create a space only with letters, or only with numbers. They surely do create unbalanced spacing because there are words that start with one kind and end with a different kind like we are seeing with the examples.

If they are truly useful and needed despite added complexities, I agree ideograph-symbol, or actually ideograph-everything-else would be necessary. And a definition of non-ideograph that covers all characters that are not ideographs would also be super useful.

kojiishi · 2023-11-10T07:58:35Z

Another idea is to defer detailed classifications of letters and numerals to future versions and start with normal only (IIUC that's what iOS/macOS does.)

@kojiishi Apple's normal adds more symbols, such as space after % in iOS screenshots. This requires them to share the exact rules.

Right, thanks. Yes, I mean if Apple can disclose it. Sorry if my comment didn't read that way.

xfq · 2023-11-11T08:43:42Z

@xfq Thank you. Got it. Do you know why ideograph-alpha and ideograph-numeric are created when "non-ideograph" might be all what you need? I can't think of scenarios where you create a space only with letters, or only with numbers. They surely do create unbalanced spacing because there are words that start with one kind and end with a different kind like we are seeing with the examples.

If they are truly useful and needed despite added complexities, I agree ideograph-symbol, or actually ideograph-everything-else would be necessary. And a definition of non-ideograph that covers all characters that are not ideographs would also be super useful.

I agree that adding extra spacing only between ideographs and non-ideographic letters, or only between ideographs and non-ideographic numerals is not useful. However, there are some characters that should not have extra spacing between ideographs and them, such as:

some Chinese/Japanese punctuation, like 。、，：；！？「」（）《》——……
footnote marks like *, †, ‡, and ◊
emoji
whitespace characters

There are also some characters that I'm not sure, such as Taixuanjing symbols (like U+1D300), mahjong tiles (like U+1F000), Xiangqi symbols (like U+1FA60), copyright/copyleft signs, and so on.

kidayasuo · 2023-11-12T08:25:31Z

However, there are some characters that should not have extra spacing between ideographs and them, such as:

I agree. So, it seems we need 'neutral'? Do we need right/left directionality? Such neutrals create unbalanced spacing if/when they are used as a part of a word or a phrase. So, we might want to limit them to some small number. If the amount of space is small like 1/8 of a fullwidth like Apple does, we might be able to say ok to create a space for some edge cases.

xfq · 2024-01-18T06:04:10Z

Based on our discussion in yesterday's clreq teleconference, we think it would be useful to make this behaviour language-dependant because of the difference in conventions between Chinese and Japanese.

For example, in Japanese, it's normal to have extra spacing before "12" but not after "%" in the phrase "永永永12%永永永". However, in Chinese there's extra spacing after "%".

kojiishi · 2024-01-18T09:43:02Z

@xfq Thanks for the info. I haven't checked with JLREQ folks, but I don't think this is language dependent. If the text is "永永永12%永永永" then I believe Japanese expects spacing after "%" too.

The complexity of handling punctuation and symbols is that it depends on the context, but supporting longer context slows down the layout engine quite severely.

Imagine "永永永12%永永永" and "永永永X%永永永" with the CSS text-autospace: ideograph-numeric. Ideally, I hope you agree, we want the spacing after "%" for the first case but not for the second. Doing this requires more context than adjacent two characters, and this could be longer, such as "永永永mininum-maximum%永永永". They could also appear alone, such as when "how many % is this?" ("何%ですか?" in Japanese).

It should be a bit simpler if CSS doesn't distinguish ideograph-numeric and ideograph-alpha, but even if we unite them, there are always edge cases, similar to the UAX#9 Bidi Algorithm isn't always perfect.

The discussion should move to Unicode once the proposal is accepted, and I hope we can find a good balance of desired results, complexity, and performance there.

xfq · 2024-01-22T06:59:47Z

@xfq Thanks for the info. I haven't checked with JLREQ folks, but I don't think this is language dependent. If the text is "永永永12%永永永" then I believe Japanese expects spacing after "%" too.

I got this information from w3c/jlreq#387 :

敏先生: アンバランスになる問題について：「これは12%です」という文で12の前は開けるが%の後は開けない、というのは日本語では普通。なので、アンバランスが即悪いわけではないのでは？（敏先生）

Although I'm not sure which bahaviour more common / expected.

The complexity of handling punctuation and symbols is that it depends on the context, but supporting longer context slows down the layout engine quite severely.

Imagine "永永永12%永永永" and "永永永X%永永永" with the CSS text-autospace: ideograph-numeric. Ideally, I hope you agree, we want the spacing after "%" for the first case but not for the second. Doing this requires more context than adjacent two characters, and this could be longer, such as "永永永mininum-maximum%永永永". They could also appear alone, such as when "how many % is this?" ("何%ですか?" in Japanese).

It should be a bit simpler if CSS doesn't distinguish ideograph-numeric and ideograph-alpha, but even if we unite them, there are always edge cases, similar to the UAX#9 Bidi Algorithm isn't always perfect.

Indeed.

The discussion should move to Unicode once the proposal is accepted, and I hope we can find a good balance of desired results, complexity, and performance there.

If this is language-dependant, it may be difficult to solve the problem at the Unicode level only. Also, if the rule is defined in a Unicode character property, it's very difficult to change.

IIRC it's on the agenda of UTC 178 this week, so let's see what the Unicode experts think about it.

kojiishi · 2024-01-24T04:41:07Z

@xfq Thanks for the info. I haven't checked with JLREQ folks, but I don't think this is language dependent. If the text is "永永永12%永永永" then I believe Japanese expects spacing after "%" too.

I got this information from w3c/jlreq#387 :

敏先生: アンバランスになる問題について：「これは12%です」という文で12の前は開けるが%の後は開けない、というのは日本語では普通。なので、アンバランスが即悪いわけではないのでは？（敏先生）

Although I'm not sure which bahaviour more common / expected.

Thanks for the link I missed it. I think it's more about style, not language. Probably a diff between traditional print style and online text style.

kidayasuo · 2024-02-15T07:22:06Z

According to Bin-sensei, the spacing is intended to prevent characters from being too close together, not to highlight words like parentheses do. Such 'unbalanced' situations are actually common practice in publications.

Adding the following comment on behalf of Bin-sensei:
Above comment does not of course preclude using a pair of spaces to highlight a word. There is nothing wrong of doing so. It just says that such usage is not a common practice.

taroyamamoto-451 · 2024-03-20T23:56:11Z

I disagree with not applying auto-spacing between a Japanese character and a Western punctuation mark. I believe it's not a matter of visual "balance" but a matter of consistency. In fact, as far as I remember, for instance, Morisawa-Linotype's CORA5-E text composition language designed for Linotype CRT/laser typesetters used by Japanese professional typographers allowed auto-spacing between a Japanese character and a Western punctuation. I don't mean you "must" always do it, but it is one of widely accepted conventions in Japanese typography.

macnmm · 2024-03-21T01:33:26Z

Talking to Ned from Apple, he says their algorithm is quite involved, and allows for both compression and expansion of the default spacing, and some spacing will take the glyph ink into account as part of the logic. So, this leads me to believe that we need to approach this problem with a bit more rigor and nuance. For example:

Solve the Unicode SJIS unification issue with Variation Selectors
Verify the minimum spacing behavior variations and define spacing classes on the Unicode ranges + VSs
Define the spacing behavior patterns for each spacing class pair (what is the minimum, desired, maximum spacing amount; what are the compression and expansion conditions and priorities)
Advocate for fonts to standardize their glyph designs to conform to the Unicode + VSs, and their defined spacing behaviors
Advocate for layout engines to implement the spacing rules according to the spacing behavior variations

All this is to say that the proposal from Koji may not be sufficient to solve the Latin-to-J or Latin-to-CJK spacing issue, and that that issue is merely a single case of the generic spacing rules issue defined in JLReq or JIS X 4051 and so it should try for a higher bar from the beginning.

kidayasuo · 2024-03-28T01:45:02Z

@macnmm as repeated in the document, the proposal is not an effort to make a definite rule. It is intended to serve as a fallback default when no other information is specified by the higher level protocol.

By having a reliable base, customizations become much easier because your description can be only the diff from the default. It is a benefit of having a stable default.

kidayasuo · 2024-03-28T14:32:57Z

@macnmm I like the idea of using the variation selectors as a potential solution to the challenges posed by the unification of code points for characters that are used differently in Western texts and in Japanese, despite their inherent distinctions.

My understanding however is, as they are all proposed to be class "O" regardless of if they are fullwidth or proportional, it is an orthogonal issue. May be I am missing something……

macnmm · 2024-04-07T03:45:52Z

It may be the proposal strayed from where I hoped it would land, but my hope is if you can define a VS with the missing SJIS width, spacing class, and vertical writing posture, you solve the Unicode unification issues for Japanese character behavior in line layout. So, I would say we push for this.

xfq added css-text-4 i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. i18n-jlreq Japanese language enablement i18n-clreq Chinese language enablement labels Oct 17, 2023

w3cbot mentioned this issue Oct 17, 2023

[css-text] Extra spacing between ideographs and non-fullwidth punctuation/symbols w3c/i18n-activity#1779

Open

aphillips added i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. and removed i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. labels Oct 26, 2023

This was referenced Nov 15, 2023

Space between Japanese and Western characters: Symbols for footnotes (*†‡◊) w3c/jlreq-d#44

Open

JLReq TF Meeting Notes - 2023-10-31 w3c/jlreq#382

Open

fantasai changed the title ~~[css-text] Extra spacing between ideographs and non-fullwidth punctuation/symbols~~ [css-text][text-spacing Extra spacing between ideographs and non-fullwidth punctuation/symbols Jan 9, 2024

fantasai changed the title ~~[css-text][text-spacing Extra spacing between ideographs and non-fullwidth punctuation/symbols~~ [css-text][text-spacing] Extra spacing between ideographs and non-fullwidth punctuation/symbols Jan 9, 2024

kojiishi mentioned this issue Feb 27, 2024

Should "%" and some symbols and punctuation characters be N? kojiishi/unicode-auto-spacing#11

Closed

This was referenced Mar 27, 2024

Should Currency Symbols be N? kojiishi/unicode-auto-spacing#12

Closed

Unicode VS and advocating fonts/layout engines kojiishi/unicode-auto-spacing#17

Closed

kojiishi mentioned this issue Jun 30, 2024

Unicode VS and advocating fonts/layout engines unicode-org/unicodetools#766

Closed

kojiishi mentioned this issue Oct 9, 2024

[css-text] Use the Unicode East Asian Auto Spacing for text-autospace #11013

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[css-text][text-spacing] Extra spacing between ideographs and non-fullwidth punctuation/symbols #9479

[css-text][text-spacing] Extra spacing between ideographs and non-fullwidth punctuation/symbols #9479

xfq commented Oct 17, 2023

frivoal commented Oct 23, 2023

kojiishi commented Oct 23, 2023

xfq commented Oct 23, 2023

frivoal commented Oct 23, 2023

kojiishi commented Oct 24, 2023

Clqsin45 commented Oct 25, 2023

xfq commented Oct 26, 2023

yisibl commented Nov 9, 2023 •

edited

Loading

kojiishi commented Nov 9, 2023

kidayasuo commented Nov 10, 2023

xfq commented Nov 10, 2023

yisibl commented Nov 10, 2023 •

edited

Loading

kidayasuo commented Nov 10, 2023 •

edited

Loading

kojiishi commented Nov 10, 2023

xfq commented Nov 11, 2023 •

edited

Loading

kidayasuo commented Nov 12, 2023

xfq commented Jan 18, 2024

kojiishi commented Jan 18, 2024 •

edited

Loading

xfq commented Jan 22, 2024

kojiishi commented Jan 24, 2024

kidayasuo commented Feb 15, 2024 •

edited

Loading

taroyamamoto-451 commented Mar 20, 2024

macnmm commented Mar 21, 2024 •

edited

Loading

kidayasuo commented Mar 28, 2024

kidayasuo commented Mar 28, 2024

macnmm commented Apr 7, 2024 •

edited

Loading

[css-text][text-spacing] Extra spacing between ideographs and non-fullwidth punctuation/symbols #9479

[css-text][text-spacing] Extra spacing between ideographs and non-fullwidth punctuation/symbols #9479

Comments

xfq commented Oct 17, 2023

frivoal commented Oct 23, 2023

kojiishi commented Oct 23, 2023

xfq commented Oct 23, 2023

frivoal commented Oct 23, 2023

kojiishi commented Oct 24, 2023

Clqsin45 commented Oct 25, 2023

xfq commented Oct 26, 2023

yisibl commented Nov 9, 2023 • edited Loading

kojiishi commented Nov 9, 2023

kidayasuo commented Nov 10, 2023

xfq commented Nov 10, 2023

yisibl commented Nov 10, 2023 • edited Loading

kidayasuo commented Nov 10, 2023 • edited Loading

kojiishi commented Nov 10, 2023

xfq commented Nov 11, 2023 • edited Loading

kidayasuo commented Nov 12, 2023

xfq commented Jan 18, 2024

kojiishi commented Jan 18, 2024 • edited Loading

xfq commented Jan 22, 2024

kojiishi commented Jan 24, 2024

kidayasuo commented Feb 15, 2024 • edited Loading

taroyamamoto-451 commented Mar 20, 2024

macnmm commented Mar 21, 2024 • edited Loading

kidayasuo commented Mar 28, 2024

kidayasuo commented Mar 28, 2024

macnmm commented Apr 7, 2024 • edited Loading

yisibl commented Nov 9, 2023 •

edited

Loading

yisibl commented Nov 10, 2023 •

edited

Loading

kidayasuo commented Nov 10, 2023 •

edited

Loading

xfq commented Nov 11, 2023 •

edited

Loading

kojiishi commented Jan 18, 2024 •

edited

Loading

kidayasuo commented Feb 15, 2024 •

edited

Loading

macnmm commented Mar 21, 2024 •

edited

Loading

macnmm commented Apr 7, 2024 •

edited

Loading