Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

character width seems to be "random" on complexe Unicode sequences #5047

Closed
Cl00e9ment opened this issue May 1, 2022 · 11 comments
Closed

character width seems to be "random" on complexe Unicode sequences #5047

Cl00e9ment opened this issue May 1, 2022 · 11 comments
Labels

Comments

@Cl00e9ment
Copy link

For sake of understanding here's the nomenclature that I'm using:

  • scalar value: A Unicode unit, each scalar value is assigned a code point and is encoded using 1 to 4 bytes (in UTF-8).
  • grapheme cluster: What the end user call a character and is rendered as a single glyph. Grapheme cluster are made of at least one scalar value.

Describe the bug
Some grapheme clusters are rendered in a single cell and some other are taking multiple cells, sometimes leaving a huge blank.

To Reproduce
example 1:
nb of grapheme clusters: 1
nb of scalar values: 2
nb of cells used for rendering: 1
echo -e "0123456789\n>\u0067\u0308<"
Screenshot from 2022-05-01 21-10-42
This is what is expected to happen: 1 grapheme cluster = 1 cell.

example 2:
nb of grapheme clusters: 1
nb of scalar values: 2
nb of cells used for rendering: 2
echo -e "0123456789\n>\u2600\ufe0f<"
Screenshot from 2022-05-01 21-13-49
This is not what I would expect (1 grapheme cluster = 1 cell) but maybe it's a normal behavior. If this is normal, is there a set of rules that I can use to determine the nb of cells that a grapheme cluster will take for rendering?

example 3:
nb of grapheme clusters: 1
nb of scalar values: 3
nb of cells used for rendering: 4
echo -e "0123456789\n>\u1100\u1161\u11A8<"
Screenshot from 2022-05-01 21-18-15
This doesn't make any sense.

Environment details

kitty 0.24.4 created by Kovid Goyal
Linux workstation 5.15.32-1-MANJARO #1 SMP PREEMPT Mon Mar 28 09:16:36 UTC 2022 x86_64
Manjaro Linux 5.15.32-1-MANJARO  (workstation) (/dev/tty)


DISTRIB_ID=ManjaroLinux
DISTRIB_RELEASE=21.2.6
DISTRIB_CODENAME=Qonos
DISTRIB_DESCRIPTION="Manjaro Linux"
Running under: X11
Frozen: False
Paths:
  kitty: /usr/bin/kitty
  base dir: /usr/lib/kitty
  extensions dir: /usr/lib/kitty/kitty
  system shell: /bin/zsh

Config options different from defaults:

Important environment variables seen by the kitty process:
	PATH                                /home/clement/.local/bin:/usr/local/bin:/usr/bin:/var/lib/snapd/snap/bin:/usr/local/sbin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/home/clement/.cargo/bin
	LANG                                en_US.UTF-8
	EDITOR                              nvim
	SHELL                               /bin/zsh
	DISPLAY                             :0
	USER                                clement
	XDG_MENU_PREFIX                     gnome-
	LC_ADDRESS                          fr_FR.UTF-8
	LC_NAME                             fr_FR.UTF-8
	LC_MONETARY                         fr_FR.UTF-8
	XDG_SESSION_DESKTOP                 gnome-xorg
	XDG_SESSION_TYPE                    x11
	LC_PAPER                            fr_FR.UTF-8
	XDG_CURRENT_DESKTOP                 GNOME
	XDG_SESSION_CLASS                   user
	LC_IDENTIFICATION                   fr_FR.UTF-8
	LC_TELEPHONE                        fr_FR.UTF-8
	LC_MEASUREMENT                      fr_FR.UTF-8
	XDG_RUNTIME_DIR                     /run/user/1000
	LC_TIME                             fr_FR.UTF-8
	XDG_DATA_DIRS                       /home/clement/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share/:/usr/share/:/var/lib/snapd/desktop
	LC_NUMERIC                          fr_FR.UTF-8
@Cl00e9ment Cl00e9ment added the bug label May 1, 2022
@kovidgoyal
Copy link
Owner

The width in cells of grapheme clusters come from the unicode standard. Your example 2 is a variation selector changing the emoji presentation of the preceding codepoint from text to emoji. emoji are rendered in two cells in terminals. I have no clue about hangul so I cant explain your last example to you. You would need to ask the makers of the unicode standard. See gen-wcwidth.py in kitty for how the functions to determine width are generated from the standard.

@Cl00e9ment
Copy link
Author

Thanks for the clarifications.

@Cl00e9ment
Copy link
Author

The width in cells of grapheme clusters come from the unicode standard.

If I'm not mistaken, Unicode don't specify the width of grapheme clusters. It's up to the implementation to define this. UAX #11 differentiates narrow and wide characters in East Asian text, but that's all.

So, are you sure that the third example (see bellow) isn't a Kitty bug but an issue with the Unicode standard?

echo -e "0123456789\n>\u1100\u1161\u11A8<"

I understand that gen-wcwidth.py assign a cell width for each scalar value, but I don't see where this directly comes from the Unicode standard (apart for East Asian scalar values).

@kovidgoyal
Copy link
Owner

They are width one unless they are emoji or combining marks which are
width zero (with some slight subtleties read the source of
gen-wcwidth.py) Its a pretty simple rule.

@Cl00e9ment
Copy link
Author

Yes but the rules that were defined in gen-wcwidth.py don't seam to work everywhere. The Hangul alphabet is an example of an edge case.

Maybe Hangul initial consonants should be given a size of 2, and medial vowels as well as final consonants should be given a size of 0. That's only a suggestion, as I've no idea of how Hangul works.

@Cl00e9ment
Copy link
Author

OK I think that I understand what's happening.

UAX #11 gives a size of 2 to Hangul initial consonants (HIC) and a size of 1 to both Hangul medial vowels (HMV) and final consonants (HFC).

The problem is that, when one HIC is followed by one HMV and optionally one HFC, they merge together to form a single grapheme cluster. The widths are added together (2 + 1 = 3 or 2 + 1 + 1 = 4) instead of using a size of 2.

@kovidgoyal
Copy link
Owner

someone will need to codify that then. And publish it as a standard so
terminal programs can rely on it.

@Cl00e9ment
Copy link
Author

I fully agree.

As a side note, this issue also affects emoji combination with zero width joiner.
Example with "face in clouds":

echo -e "0123456789\n>\U1F636\U1F32B\uFE0F<"
echo -e "0123456789\n>\U1F636\u200D\U1F32B\uFE0F<"

Screenshot from 2022-05-02 17-02-56

@kovidgoyal
Copy link
Owner

THat is a bug look at the open issue about it.

@Cl00e9ment
Copy link
Author

You're right, the size problem with combining emojis is a duplicate of #1978.

But further than that, I think that the rendering issue with Hangul grapheme clusters is the same bug. Even if they aren't built using zero width joiner like emoji combinations, the underlying problem is the same: multiple grapheme clusters that are rendered using a specific size, but when put together, merge to a single grapheme cluster that take less space than the sum of the previous ones.

@kovidgoyal
Copy link
Owner

The difference is for zwj + emoji there are well defined rules accessible to me in the unicode standard. For hangul I have no clue. As I said someone who understand hangul will either need to codify those rules and publish them or point out where in the standard they already exist in a form that can be converted to wcswidth() implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants