-
-
Notifications
You must be signed in to change notification settings - Fork 989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
character width seems to be "random" on complexe Unicode sequences #5047
Comments
The width in cells of grapheme clusters come from the unicode standard. Your example 2 is a variation selector changing the emoji presentation of the preceding codepoint from text to emoji. emoji are rendered in two cells in terminals. I have no clue about hangul so I cant explain your last example to you. You would need to ask the makers of the unicode standard. See gen-wcwidth.py in kitty for how the functions to determine width are generated from the standard. |
Thanks for the clarifications. |
If I'm not mistaken, Unicode don't specify the width of grapheme clusters. It's up to the implementation to define this. UAX #11 differentiates narrow and wide characters in East Asian text, but that's all. So, are you sure that the third example (see bellow) isn't a Kitty bug but an issue with the Unicode standard?
I understand that gen-wcwidth.py assign a cell width for each scalar value, but I don't see where this directly comes from the Unicode standard (apart for East Asian scalar values). |
They are width one unless they are emoji or combining marks which are |
Yes but the rules that were defined in gen-wcwidth.py don't seam to work everywhere. The Hangul alphabet is an example of an edge case. Maybe Hangul initial consonants should be given a size of 2, and medial vowels as well as final consonants should be given a size of 0. That's only a suggestion, as I've no idea of how Hangul works. |
OK I think that I understand what's happening. UAX #11 gives a size of 2 to Hangul initial consonants (HIC) and a size of 1 to both Hangul medial vowels (HMV) and final consonants (HFC). The problem is that, when one HIC is followed by one HMV and optionally one HFC, they merge together to form a single grapheme cluster. The widths are added together (2 + 1 = 3 or 2 + 1 + 1 = 4) instead of using a size of 2. |
someone will need to codify that then. And publish it as a standard so |
I fully agree. As a side note, this issue also affects emoji combination with zero width joiner.
|
THat is a bug look at the open issue about it. |
You're right, the size problem with combining emojis is a duplicate of #1978. But further than that, I think that the rendering issue with Hangul grapheme clusters is the same bug. Even if they aren't built using zero width joiner like emoji combinations, the underlying problem is the same: multiple grapheme clusters that are rendered using a specific size, but when put together, merge to a single grapheme cluster that take less space than the sum of the previous ones. |
The difference is for zwj + emoji there are well defined rules accessible to me in the unicode standard. For hangul I have no clue. As I said someone who understand hangul will either need to codify those rules and publish them or point out where in the standard they already exist in a form that can be converted to wcswidth() implementation. |
For sake of understanding here's the nomenclature that I'm using:
Describe the bug
Some grapheme clusters are rendered in a single cell and some other are taking multiple cells, sometimes leaving a huge blank.
To Reproduce
example 1:
nb of grapheme clusters: 1
nb of scalar values: 2
nb of cells used for rendering: 1
echo -e "0123456789\n>\u0067\u0308<"
This is what is expected to happen: 1 grapheme cluster = 1 cell.
example 2:
nb of grapheme clusters: 1
nb of scalar values: 2
nb of cells used for rendering: 2
echo -e "0123456789\n>\u2600\ufe0f<"
This is not what I would expect (1 grapheme cluster = 1 cell) but maybe it's a normal behavior. If this is normal, is there a set of rules that I can use to determine the nb of cells that a grapheme cluster will take for rendering?
example 3:
nb of grapheme clusters: 1
nb of scalar values: 3
nb of cells used for rendering: 4
echo -e "0123456789\n>\u1100\u1161\u11A8<"
This doesn't make any sense.
Environment details
The text was updated successfully, but these errors were encountered: