character width seems to be "random" on complexe Unicode sequences #5047

Cl00e9ment · 2022-05-01T19:47:26Z

For sake of understanding here's the nomenclature that I'm using:

scalar value: A Unicode unit, each scalar value is assigned a code point and is encoded using 1 to 4 bytes (in UTF-8).
grapheme cluster: What the end user call a character and is rendered as a single glyph. Grapheme cluster are made of at least one scalar value.

Describe the bug
Some grapheme clusters are rendered in a single cell and some other are taking multiple cells, sometimes leaving a huge blank.

To Reproduce
example 1:
nb of grapheme clusters: 1
nb of scalar values: 2
nb of cells used for rendering: 1
echo -e "0123456789\n>\u0067\u0308<"

This is what is expected to happen: 1 grapheme cluster = 1 cell.

example 2:
nb of grapheme clusters: 1
nb of scalar values: 2
nb of cells used for rendering: 2
echo -e "0123456789\n>\u2600\ufe0f<"

This is not what I would expect (1 grapheme cluster = 1 cell) but maybe it's a normal behavior. If this is normal, is there a set of rules that I can use to determine the nb of cells that a grapheme cluster will take for rendering?

example 3:
nb of grapheme clusters: 1
nb of scalar values: 3
nb of cells used for rendering: 4
echo -e "0123456789\n>\u1100\u1161\u11A8<"

This doesn't make any sense.

Environment details

kitty 0.24.4 created by Kovid Goyal
Linux workstation 5.15.32-1-MANJARO #1 SMP PREEMPT Mon Mar 28 09:16:36 UTC 2022 x86_64
Manjaro Linux 5.15.32-1-MANJARO  (workstation) (/dev/tty)


DISTRIB_ID=ManjaroLinux
DISTRIB_RELEASE=21.2.6
DISTRIB_CODENAME=Qonos
DISTRIB_DESCRIPTION="Manjaro Linux"
Running under: X11
Frozen: False
Paths:
  kitty: /usr/bin/kitty
  base dir: /usr/lib/kitty
  extensions dir: /usr/lib/kitty/kitty
  system shell: /bin/zsh

Config options different from defaults:

Important environment variables seen by the kitty process:
	PATH                                /home/clement/.local/bin:/usr/local/bin:/usr/bin:/var/lib/snapd/snap/bin:/usr/local/sbin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/home/clement/.cargo/bin
	LANG                                en_US.UTF-8
	EDITOR                              nvim
	SHELL                               /bin/zsh
	DISPLAY                             :0
	USER                                clement
	XDG_MENU_PREFIX                     gnome-
	LC_ADDRESS                          fr_FR.UTF-8
	LC_NAME                             fr_FR.UTF-8
	LC_MONETARY                         fr_FR.UTF-8
	XDG_SESSION_DESKTOP                 gnome-xorg
	XDG_SESSION_TYPE                    x11
	LC_PAPER                            fr_FR.UTF-8
	XDG_CURRENT_DESKTOP                 GNOME
	XDG_SESSION_CLASS                   user
	LC_IDENTIFICATION                   fr_FR.UTF-8
	LC_TELEPHONE                        fr_FR.UTF-8
	LC_MEASUREMENT                      fr_FR.UTF-8
	XDG_RUNTIME_DIR                     /run/user/1000
	LC_TIME                             fr_FR.UTF-8
	XDG_DATA_DIRS                       /home/clement/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share/:/usr/share/:/var/lib/snapd/desktop
	LC_NUMERIC                          fr_FR.UTF-8

The text was updated successfully, but these errors were encountered:

kovidgoyal · 2022-05-02T01:45:57Z

The width in cells of grapheme clusters come from the unicode standard. Your example 2 is a variation selector changing the emoji presentation of the preceding codepoint from text to emoji. emoji are rendered in two cells in terminals. I have no clue about hangul so I cant explain your last example to you. You would need to ask the makers of the unicode standard. See gen-wcwidth.py in kitty for how the functions to determine width are generated from the standard.

Cl00e9ment · 2022-05-02T06:15:20Z

Thanks for the clarifications.

Cl00e9ment · 2022-05-02T12:47:58Z

The width in cells of grapheme clusters come from the unicode standard.

If I'm not mistaken, Unicode don't specify the width of grapheme clusters. It's up to the implementation to define this. UAX #11 differentiates narrow and wide characters in East Asian text, but that's all.

So, are you sure that the third example (see bellow) isn't a Kitty bug but an issue with the Unicode standard?

echo -e "0123456789\n>\u1100\u1161\u11A8<"

I understand that gen-wcwidth.py assign a cell width for each scalar value, but I don't see where this directly comes from the Unicode standard (apart for East Asian scalar values).

kovidgoyal · 2022-05-02T13:46:14Z

They are width one unless they are emoji or combining marks which are
width zero (with some slight subtleties read the source of
gen-wcwidth.py) Its a pretty simple rule.

Cl00e9ment · 2022-05-02T14:31:14Z

Yes but the rules that were defined in gen-wcwidth.py don't seam to work everywhere. The Hangul alphabet is an example of an edge case.

Maybe Hangul initial consonants should be given a size of 2, and medial vowels as well as final consonants should be given a size of 0. That's only a suggestion, as I've no idea of how Hangul works.

Cl00e9ment · 2022-05-02T14:56:54Z

OK I think that I understand what's happening.

UAX #11 gives a size of 2 to Hangul initial consonants (HIC) and a size of 1 to both Hangul medial vowels (HMV) and final consonants (HFC).

The problem is that, when one HIC is followed by one HMV and optionally one HFC, they merge together to form a single grapheme cluster. The widths are added together (2 + 1 = 3 or 2 + 1 + 1 = 4) instead of using a size of 2.

kovidgoyal · 2022-05-02T15:00:15Z

someone will need to codify that then. And publish it as a standard so
terminal programs can rely on it.

Cl00e9ment · 2022-05-02T15:12:47Z

I fully agree.

As a side note, this issue also affects emoji combination with zero width joiner.
Example with "face in clouds":

echo -e "0123456789\n>\U1F636\U1F32B\uFE0F<"
echo -e "0123456789\n>\U1F636\u200D\U1F32B\uFE0F<"

kovidgoyal · 2022-05-02T16:02:19Z

THat is a bug look at the open issue about it.

Cl00e9ment · 2022-05-02T17:07:39Z

You're right, the size problem with combining emojis is a duplicate of #1978.

But further than that, I think that the rendering issue with Hangul grapheme clusters is the same bug. Even if they aren't built using zero width joiner like emoji combinations, the underlying problem is the same: multiple grapheme clusters that are rendered using a specific size, but when put together, merge to a single grapheme cluster that take less space than the sum of the previous ones.

kovidgoyal · 2022-05-03T02:27:12Z

The difference is for zwj + emoji there are well defined rules accessible to me in the unicode standard. For hangul I have no clue. As I said someone who understand hangul will either need to codify those rules and publish them or point out where in the standard they already exist in a form that can be converted to wcswidth() implementation.

Cl00e9ment added the bug label May 1, 2022

kovidgoyal closed this as completed May 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

character width seems to be "random" on complexe Unicode sequences #5047

character width seems to be "random" on complexe Unicode sequences #5047

Cl00e9ment commented May 1, 2022

kovidgoyal commented May 2, 2022

Cl00e9ment commented May 2, 2022

Cl00e9ment commented May 2, 2022

kovidgoyal commented May 2, 2022

Cl00e9ment commented May 2, 2022

Cl00e9ment commented May 2, 2022

kovidgoyal commented May 2, 2022

Cl00e9ment commented May 2, 2022

kovidgoyal commented May 2, 2022

Cl00e9ment commented May 2, 2022

kovidgoyal commented May 3, 2022

character width seems to be "random" on complexe Unicode sequences #5047

character width seems to be "random" on complexe Unicode sequences #5047

Comments

Cl00e9ment commented May 1, 2022

kovidgoyal commented May 2, 2022

Cl00e9ment commented May 2, 2022

Cl00e9ment commented May 2, 2022

kovidgoyal commented May 2, 2022

Cl00e9ment commented May 2, 2022

Cl00e9ment commented May 2, 2022

kovidgoyal commented May 2, 2022

Cl00e9ment commented May 2, 2022

kovidgoyal commented May 2, 2022

Cl00e9ment commented May 2, 2022

kovidgoyal commented May 3, 2022