-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
character joiner api unsuitable for graphemes #4513
Comments
Wouldn't we be over extending our hand by changing how wrapping works to accommodate graphemes? Yes it will lead to a less than optimal reading experience, but it will be good enough imo without introducing all this complexity. For my own info, I see a bunch of Hangul handling, I wouldn't have thought graphemes would have anything to do with hangul as they are always double width? |
Take a look at this DomTerm screenshot (2nd screenshot on the page). Consider 'woman+zwj+woman+zwj+boy'. If you calculate the number of columns without being grapheme-aware, you get 6 columns. However, it's supposed to be 2 columns. If you do layout and line-breaking assuming it's 6 columns and then render it as 2 columns, that's not a "less than optimal reading experience" - it's wrong.
Hangul be be encoded using either the Hangul syllables block or the Hangul Jamo block. The former are "pre-composed" but only support the common/modern syllables. Jamo are like building-blocks and you can use them to construct many more forms. Similar to how Both forms of the same text should use 2 columns - but using the EastAsianWidth.tx table with grapheme handling gives 5 columns for the example shown. I suspect the expectation for "proper Korean text handling" these days means getting this right. I notice that both Gnome Terminal and KDE Konsole (but not xterm) seem to handle the Hangul example correctly - though most of the other tests are wrong. In my fork the grapheme tables and lookup are in a separate I haven't found a Unicode specification for how grapheme cluster should be mapped to narrow wide characters, but I use the general rule that the cluster is the width of the widest character in it. Regional Indicators are special: A RI by itself is 1 column, but a pair is 2 columns. |
I was under the impression that emoji composition would be handled entirely in the parser, it would compose the characters as they come in and then hand off the combined char to the buffer line? One of my concerns is due to the size of your branch it will probably take a long time to get in as my time for reviewing/testing this sort of thing is quite limited these days. Also you seem to have ripped out character joiners all together, these are how ligatures are implemented?
I actually know a bit of Korean, I'm not sure I understand still. We seem to handle it just fine as it's the IME's job to take care of composing the character before sending it to the shell: |
That is impossible - in many cases composed characters don't exist as separate Unicode characters. Consider the regional indicators - or "woman+zwj+woman+zwj+boy" from the screenshot linked in my first message.
If I wasn't clear: My fork is nowhere ready for use. Some simple things work, but a bunch of things don't work, and some that work need to be optimized. Grapheme clusters don't quite work yet, but it's close - most of the pieces are ready. No expectation for a review anytime soon. I also want to emphasize: My changes to BufferLine aren't solely motivated by grapheme clusters: By giving up guaranteed O(1) column indexing we gain a lot of flexibility. (I can talk more about this when more is implemented.) We also get a more compact data structure.
I've ripped them out for now while getting things working. I think the existing logic (i.e. ligatures as a render-time-only functionality) can probably be re-enabled with minor changes. The main difference is that JoinedCellData would no longer be recommended for grapheme clusters, but would be suitable for width-preserving substitutions including ligatures. Another option is that ligatures can stored in the BufferLine, which could be a lot faster, as well as supporting non-width-preserving ligatures. The more flexible BufferLine data structure could store in the buffer both the "logical character(s)" and also the "ligature to render", with flags to control which data is used in which context. About Korean: IME only handles user input - we also need to handle output. Wcwidth returns a width of 2 for base characters (0x1100 to 0x1160) and a width of 0 for "combining characters". The seems to do the right thing for the few cases I've tried, and may be enough to handle most or all cases that will appear in practice. I have no idea how robust or general it is. (I don't know Korean.) |
My purpose with these changes is to possibly replace the DomTerm DOM-based layout with one based on xterm.js. The idea is to combine the advantages of xterm.js (primarily performance) with as much as possible of the existing and future DomTerm functionality. That will require some major changes to xterm.js, starting with the BufferLine implementation. I expect I will have to maintain some of the changes in a fork, but the fewer changes I have relative to upstream the better for everyone. (I imagine that some of the features I'm hoping to add will be useful to both Jupyter and VsCode.) Some features that I'm hoping will be eased by the BufferLine changes:
This is all a long-term dream. I'll work a bit at a time. |
👌 I just skimmed it, I want to set some expectations that it may be tough to get some of this stuff merged, don't want to waste your time.
This is what the the
I'm curious about this, it seems like it's as compact as it can get basically currently via the Uint32Array for data. We intentionally optimized for the common case where the data would fix into a Uint32. Any changes to buffer line would be best discussed in a separate issue as it seems like it's own distinct addition.
Ligatures should be done purely at the renderer level since whether a set of characters are ligatures depends on the current font family.
I haven't used any CLIs that use Korean, I suspect you may be trying to support something in Korean that never actually happens in practice on the command line.
This sort of thing is why the decorations API was introduced. We use this API to implement find highlighting and line overlays like vscode's shell integration indicator and command navigation:
FYI VS Code has a pretty robust shell integration setup now, we intentionally kept it separate from xterm.js as it's custom/opinionated and relies on a script actually running. This is one of the main reasons the |
If it has to be private fork, it can still be winning for DomTerm. But of course it is always better if it can be upstreamed.
I figured that out. The
A line consisting of 80 single-width BMP characters with default attributes requires the
Absolutely. I will create a separate issue before I create a pull request. I guess I can do so now. It might make sense to wait until more of the implementation is done, though early feedback might be helpful.
I have to study the decorator API. However, I suspect my use cases aren't exactly a match for decorators (though they could build on top of it). Images/SVG/DOM as part of the output should be part of the data model: They should be included in serialization and selections (including selecting just part of HTML rich text). Navigable in view mode. Preferably, height not constrained to an integer number of rows. |
For decorations, we could relatively easily add the selections/copy by adding a delegate to fetch selection data to |
Sorry for chimming in late and not having read any sources yet mentioned here - I just want to point out, how I intended to solve the grapheme cluster issue in a much earlier draft PR (in 2019 or 2020):
This was what I found to be in line with what most other cluster-ware TEs do. This is still very flawed in itself, but would resemble what most other do. The reason I did not go forward with that approach was mostly because of:
which I found not to be justified with the big ambiguity of Unicode specs, how to treat xy constellation. The ambiguity mainly raises from the weak "East Asian Width" property handling (def. discouraged by Unicode), which gets worse for clusters in non LCG script systems with dependencies on fonts and their certain glyph metrics (yes the unicode spec left that to font devs on purpose, which creates a nightmare for us terminals). Idk how to solve that, beside restricting either xterm.js' fonts to those, that were tested to work as intended (basically every glyph would have to be tested in its width), or calling back to the unicode consortium to spec things more properly for monospace envs. Thats the point were I gave up trying to solve that issue... I still dont know a proper solution to that issue. Unicode as specced atm, cannot be supported to a satifying level for terminals. We are in a wicked spec state here... |
That's my plan - and basically what I'm currently doing in DomTerm.
Currently planning on sticking to 1 or 2 as I don't know a need for 3 or more. (I have been thinking about non-monospace text in a TE but that is not relevant at this point.)
I'm hoping that by combining width lookup with clustering property and doing a single lookup in an efficient trie data structure the overhead will be modest - but we'll find out once it is implemented and tuned. In my prototype, the actual lookup will be opt-in, in an addon.
Is 8kB really a serious amount of space? Especially if it will be opt-in (in an addon)? |
It's fine if it's an addon imo, the main thing is we want it to be optional/lazy loadable. Similar to what we're doing with unicode. |
We had somewhere a discussion about 🗺 (U+1F5FA, world map), which renders in fonts in 2, 3 or 4 cells (yes font dependent, really annoying). For single codepoints thats the only one Ive seen so far breaking with the 1 vs 2 "convention". With clusters widths get funny in some indian script systems, but Idk enough about those. Ofc RTL scripts are a problem class of its own with their bidi mechs, and I think there are also combining constellations that would extend widths.
No 8kb is not a big deal at all, but still not very helpful, if we cannot lower the ambiguity to a sane level. Without that it just makes things worse (including runtime + space penalty). So that was a tradeoff decision back then. |
Mostly actioned with #4519, another issue will be open wrt the proposed buffer changes |
See issue #4800 for a discussion on changing the |
The character joiner api is invoked at render time. However, grapheme clusters need to be known no later than line breaking/overflow time, including reflow on window size resize. The logical place to detect clusters would seem to in the
InputHandler
print
function. This can be combined with wide-character detection, using a single lookup. DomTerm uses an efficient combined trie lookup. which I have ported/converted to TypeScript (not yet checked in).As a "heads up": I'm working on a solution for this. Unfortunately, this involves a re-design of the BufferLine implementation. On the other hand, the re-write has other benefits, which I will discuss later.
The text was updated successfully, but these errors were encountered: