Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New unicode-graphemes addon. #4519

Merged
merged 62 commits into from
Sep 12, 2023
Merged

New unicode-graphemes addon. #4519

merged 62 commits into from
Sep 12, 2023

Conversation

PerBothner
Copy link
Contributor

Handle grapheme cluster lookup as well as wcwidth.

This is based on Unicode 15, and could replace the unicode11 addon.

See the discussion in issue #4513 - but not that this pull request does not change the BufferLine implementation. (Except an extra width parameter for addCodepointToCell.

Note this combines the lookup for wcwidth and grapheme-break-property in a single trie lookup, so I'm hoping there won't be a noticable performace cost. Please let me know what you think.

Here is a screenshot of DomTerm with the xterm.js emulator. This is in a Tauri/Wry window (WebKit-based). Firefox also works. Chrome-based browsers (including Electron) mostly work - except they don't seem to handle Korean Hangul properly. (This might be fixable by changing fonts.)
xtermjs-graphemes

Handle grapheme cluster lookup as well as wcwidth.
This is based on Unicode 15, and could replace the unicode11 addon.
@PerBothner
Copy link
Contributor Author

There are some yarn test failues. I will look at them tomorrow.

@PerBothner
Copy link
Contributor Author

I fixed a bunch of tests.

There are 6 failures like the following. In spite of Googling, I haven't yet figued out to fix these.

/home/bothner/Software/xterm.js-clusters/addons/xterm-addon-unicode-graphemes/src/UnicodeGraphemesAddon.ts
  0:0  error  Parsing error: "parserOptions.project" has been set for @typescript-eslint/parser.
The file does not match your project config: addons/xterm-addon-unicode-graphemes/src/UnicodeGraphemesAddon.ts.
The file must be included in at least one of the projects provided

Except for these, yarn tests is working.

Still 2 errors in yarn test-api - plus I plan to add some tests to addons/xterm-addon-unicode-graphemes/test/UnicodeGraphemesAddon.api.ts once I have that working.

@Tyriar
Copy link
Member

Tyriar commented May 18, 2023

0:0 error Parsing error: "parserOptions.project" has been set for @typescript-eslint/parser.

You need to set the tsconfigs in .eslintrc.json:

xterm.js/.eslintrc.json

Lines 8 to 33 in b7d73f6

"parserOptions": {
"project": [
"demo/tsconfig.json",
"src/browser/tsconfig.json",
"src/common/tsconfig.json",
"src/headless/tsconfig.json",
"test/api/tsconfig.json",
"test/benchmark/tsconfig.json",
"addons/xterm-addon-attach/src/tsconfig.json",
"addons/xterm-addon-attach/test/tsconfig.json",
"addons/xterm-addon-canvas/src/tsconfig.json",
"addons/xterm-addon-fit/src/tsconfig.json",
"addons/xterm-addon-fit/test/tsconfig.json",
"addons/xterm-addon-ligatures/src/tsconfig.json",
"addons/xterm-addon-search/src/tsconfig.json",
"addons/xterm-addon-search/test/tsconfig.json",
"addons/xterm-addon-serialize/src/tsconfig.json",
"addons/xterm-addon-serialize/test/tsconfig.json",
"addons/xterm-addon-serialize/benchmark/tsconfig.json",
"addons/xterm-addon-unicode11/src/tsconfig.json",
"addons/xterm-addon-unicode11/test/tsconfig.json",
"addons/xterm-addon-web-links/src/tsconfig.json",
"addons/xterm-addon-web-links/test/tsconfig.json",
"addons/xterm-addon-webgl/src/tsconfig.json",
"addons/xterm-addon-webgl/test/tsconfig.json"
],

@PerBothner
Copy link
Contributor Author

PerBothner commented May 18, 2023

You need to set the tsconfigs in .eslintrc.json

Thanks. yarn test now works.

I added the following to "ignorePatterns", because they come from a separate unicode-properties project, which is in turn based on external projects:

    "addons/xterm-addon-unicode-graphemes/src/tiny-inflate.ts",
    "addons/xterm-addon-unicode-graphemes/src/unicode-trie.ts",
    "addons/xterm-addon-unicode-graphemes/src/UnicodeProperties.ts",

@PerBothner
Copy link
Contributor Author

Next problem: yarn test-api emits this:

1) UnicodeGraphemesAddon
       wcwidth V15 emoji test:
     page.evaluate: ReferenceError: UnicodeGraphemesAddon is not defined
    at eval (eval at evaluate (:178:30), <anonymous>:1:22)
    at eval (<anonymous>)
    at UtilityScript.evaluate (<anonymous>:178:30)
    at UtilityScript.<anonymous> (<anonymous>:1:44)
      at /home/bothner/Software/xterm.js-clusters/addons/xterm-addon-unicode-graphemes/test/UnicodeGraphemesAddon.api.ts:34:16
      at /home/bothner/Software/xterm.js-clusters/addons/xterm-addon-unicode-graphemes/out-test/UnicodeGraphemesAddon.api.js:8:71
      at __awaiter (addons/xterm-addon-unicode-graphemes/out-test/UnicodeGraphemesAddon.api.js:4:12)
      at Context.<anonymous> (addons/xterm-addon-unicode-graphemes/test/UnicodeGraphemesAddon.api.ts:33:43)

I might have to figure this out by myself - but if someone has an insight it would be appreciated.

(I'm also seeing some registerLinkProvider failures, but they also fail under an unmodified master clone.)

@Tyriar
Copy link
Member

Tyriar commented May 18, 2023

page.evaluate: ReferenceError: UnicodeGraphemesAddon is not defined

The playwright tests use the demo, this is about exposing the addon on client.ts:

xterm.js/demo/client.ts

Lines 204 to 214 in 913cb25

if (document.location.pathname === '/test') {
window.Terminal = Terminal;
window.AttachAddon = AttachAddon;
window.FitAddon = FitAddon;
window.SearchAddon = SearchAddon;
window.SerializeAddon = SerializeAddon;
window.Unicode11Addon = Unicode11Addon;
window.LigaturesAddon = LigaturesAddon;
window.WebLinksAddon = WebLinksAddon;
window.WebglAddon = WebglAddon;
} else {

@PerBothner
Copy link
Contributor Author

At this point I think this is ready for merge - or at least feedback.
It works with the demo and yarn test-api, and I implemented some tests in UnicodeGraphemesAddon.api.ts.

I'd be curious to run a performance comparison. What is the recommended way to do that?

@Tyriar
Copy link
Member

Tyriar commented May 18, 2023

Good to see it passing 👍. It's a pretty beefy PR so it might take me a couple of weeks to get to.

@jerch would know the best way to benchmark

@PerBothner
Copy link
Contributor Author

No hurry. It does mean I may want to put the BufferLine changes on hold since there is some overlap, mainly in the InputHandler logic. Though maybe not - the overlap is likely to be limited to a small area.

@jerch
Copy link
Member

jerch commented May 19, 2023

@PerBothner For perf test you can use/alter https://github.com/xtermjs/xterm.js/blob/master/test/benchmark/Terminal.benchmark.ts, and then run it with

$> yarn benchmark out-test/benchmark/Terminal.benchmark.js

You can also alter the command to run above in before --> spawn (used that to test different stress pattern in the past).

@jerch
Copy link
Member

jerch commented May 19, 2023

@PerBothner
This a quite impressive PR, thanks for looking into that nasty unicode stuff. It def. will need some time to go over it.

Imho the idea to have this first as an addon is the right choice, which gives us time to level out raw edges (if there are any, no clue yet). I was really skeptical if this could be put into an addon with a reasonable afford, so thanks for doing the hard work. ❤️
Still in the long run I'd love to see this as a core addition, once we are done polishing things.

@jerch
Copy link
Member

jerch commented May 19, 2023

@PerBothner The perf test wont run by just altering the script, it is in fact quite involved to get it working (this is mainly because of the repo structure, where addons are not seen by the base path lookups...). Still got it running, if you want I commit the needed changes into your branch.

Overall the grapheme segementation performs really well 🚀 :

  • current master
     Context "out-test/benchmark/Terminal.benchmark.js"
        Context "Terminal: ls -lR /usr/lib"
           Context "write/string/async"
              Case "#1" : 5 runs - average throughput: 26.67 MB/s
           Context "write/Utf8/async"
              Case "#1" : 5 runs - average throughput: 29.92 MB/s
    
  • with graphemes enabled
     Context "out-benchmark/benchmark/SerializeAddon.benchmark.js"
        Context "Terminal: ls -lR /usr/lib"
           Context "write/string/async"
              Case "#1" : 5 runs - average throughput: 16.46 MB/s
           Context "write/Utf8/async"
              Case "#1" : 5 runs - average throughput: 21.77 MB/s
    

so overall perf degrades by ~30% for UTF8 and ~40% for JS string input. Thats def. not as bad as I feared, so I wont block this just for perf reasons (yes I would have done that, if the perf penalty was >2 times). There are prolly still some optimizations possible, and I hope we can get these numbers below 30%.

So lets not get sidetracked too much by perf ideas here, it is imho already good enough to follow your impl route and polish interfaces and test cases first.


On a sidenote (@Tyriar) - Seems we already lost throughput over time for some reason. When I did the initial buffer rework the perf test was at ~38MB/s for UTF8 and ~32MB/s for JS strings, as far as I remember.

@PerBothner
Copy link
Contributor Author

It seems plausible that the table lookup in unicode-trie could be sped up using WASM. I don't know enough about WASM to know if the JavaScript/WASM transition adds significant overheads; if so, one could perhaps batch the property lookups with a single call WASM call for each print call.

Another possibly-worthwhile simple optimization might be to bypass the trie-lookup for ASCII (0020-007E) or below 0300 (with a few exceptions, to complicate things).

@jerch
Copy link
Member

jerch commented May 19, 2023

@PerBothner Yeah wasm speedup will most likely be limited by the granularity we can call it with, and what additional data moving is needed per step.
From my experience with wasm in the image addon - I gain there most of the speedup by calling rarely into wasm with bigger batches, and by avoiding data copies where possible (almost all my wasm impls there use borrow mechs at the output stage). With this I typically get a 2 to 4 times speedup compared to a highly optimized JS implementation - sixel decoder is ~3.5 times faster, base64 decoder ~2.5 times (all non SIMD code). With data copies on both wasm IO ends you can roughly subtract 1 from these numbers, thus wasm looks no so good anymore, if alot data juggling is involved.

So while this def. worth a shot, I would not put a priority on that yet, before we settled the inner works and have a reliable test base (wasm is quite cumbersome, when it comes to testing).

@PerBothner
Copy link
Contributor Author

The data for width+category fits in 8 bits. I also note that UnicodeV6 and UnicodeV11 use a 65536-element Uint8Array for the BMP. There is no reason UnicodeGraphemeProvider can't do the same, making a charProperties basically a simple lookup O(1) in the Uint8Array.

Beyond the BMP we could use unicode-trie or perhaps a binary search in a Uint32Array (23 bits character range start + 8 bits value).

But I agree: Lets nail down the API before worrying about further optimization.

@jerch
Copy link
Member

jerch commented May 20, 2023

To move things forward I suggest that we should check, if and how we can activate the addon and grapheme handling during browser API tests. This would give us at least a basic regression testing against top level API. This should be possible, if we rework the API tests to boostrap its terminal instance from a common module, where we already injected the grapheme addon.

It will still miss several core tests on the unit testing level done on nodeJS side, but I am unsure, if we can get it hooked in there with a reasonable effort. This is a bit of a downside of the addon abstraction - they get mixed in late, so cannot get injected easily to existing tests on unit level (still might be possible for the Terminal.test.ts module).

Why bothering with already existing tests? For mainly 3 reasons:

  • spot regressions - if we want this to be the normal operation mode somewhat in the future, we basically have to make sure, that it does not break things at other ends
  • avoid writing tests all over again - we have a quite sophisticated test battery, but due to the addon nature, they currently run w'o grapheme handling being active (they currently just run as before)
  • get faster to working tests - I hope that by injecting existing tests with grapheme handling we get much faster a reliable test battery, thus much faster back to API shaping and even perf considerations (e.g. wasm testing gets much easier once we have regressions covered by tests)

@jerch
Copy link
Member

jerch commented May 20, 2023

About the API tests - for tests using openTerminal grapheme handling can easily be injected:

diff --git a/test/api/TestUtils.ts b/test/api/TestUtils.ts
index 3fd0648e..904c0d12 100644
--- a/test/api/TestUtils.ts
+++ b/test/api/TestUtils.ts
@@ -46,6 +46,11 @@ export async function timeout(ms: number): Promise<void> {
 export async function openTerminal(page: playwright.Page, options: ITerminalOptions & ITerminalInitOnlyOptions = {}): Promise<void> {
   await page.evaluate(`window.term = new Terminal(${JSON.stringify({ allowProposedApi: true, ...options })})`);
   await page.evaluate(`window.term.open(document.querySelector('#terminal-container'))`);
+  await page.evaluate(`
+      window.unicode = new UnicodeGraphemesAddon();
+      window.term.loadAddon(window.unicode);
+      window.term.unicode.activeVersion = '15-graphemes';
+    `);
   await page.waitForSelector('.xterm-rows');
 }

This is just quickly hacked in to see, how it could be done. For better integration testing it prolly will need a conditional here, whether to run API tests with or w'o graphemes (could be done in the CI file as env var).

Also have not checked yet, if all API tests call TestUtils.openTerminal before running the tests, still only one test fails here:

  1) Unicode11Addon
       wcwidth V11 emoji test:

      AssertionError: expected [ '6', '11', '15-graphemes' ] to deeply equal [ '6', '11' ]
      + expected - actual

       [
         "6"
         "11"
      -  "15-graphemes"
       ]

(It tests unicode provider registration, but does not know about the new provider yet, thus it should have failed, hurray 😸)

@jerch
Copy link
Member

jerch commented May 20, 2023

@PerBothner Should I do a PR against your branch, so you can add the perf test and the API test changes?

@PerBothner
Copy link
Contributor Author

No opinion - whatever you think is reasonable.

@jerch
Copy link
Member

jerch commented May 20, 2023

Created PR: PerBothner#2

@PerBothner
Copy link
Contributor Author

How much longer do we wait for @DHowett? (I also sent him an email, which he hasn't responded to. Perhaps @Tyriar could make another attempt to contact him?

I updated the branch from upstream. GitHub reports some weird testsuite-failures, including lint running out of memory. (yarn lint works fine locally.)

@Tyriar
Copy link
Member

Tyriar commented Sep 11, 2023

You're right, I can have that discussion internally and unblock merging this if we just mark it as experimental in the readme.

@Tyriar
Copy link
Member

Tyriar commented Sep 11, 2023

I'll have a look at the CI failures, OOM is quite surprising.

@Tyriar
Copy link
Member

Tyriar commented Sep 11, 2023

@jerch I'll do a once over of the core files, any concerns on your end to get this merged given we have a warning in the readme now?

@jerch
Copy link
Member

jerch commented Sep 11, 2023

@Tyriar If there are no surprising changes in the newer commits (did only a quick peek) I think it is good to go as it is.

Edit: It would be really nice to get some feedback from @DHowett, esp. if he is in contact with the consortium or working closer with unicode authorities.

@Tyriar
Copy link
Member

Tyriar commented Sep 11, 2023

@jerch I'll bring it up in our next monthly sync

@Tyriar Tyriar added this to the 5.4.0 milestone Sep 11, 2023
@Tyriar
Copy link
Member

Tyriar commented Sep 11, 2023

@PerBothner fixed the CI issues:

  • I guess eslint just takes a lot of memory and we hit the threshold, that was fixed by increasing node's memory limit a003034
  • The API/playwright test failures were just because we weren't uploading the build artifacts of the new addon 1b5756c

Sorry about all the conflicts, we have a pretty sweet CI setup now though 😅

@Tyriar Tyriar dismissed jerch’s stale review September 12, 2023 20:21

Has approval to merge from @jerch

@Tyriar
Copy link
Member

Tyriar commented Sep 12, 2023

Merging! We'll set up publishing #4797

@Tyriar Tyriar merged commit bdb25c2 into xtermjs:master Sep 12, 2023
18 checks passed
@PerBothner
Copy link
Contributor Author

Thanks! Next I will "rebase" the BufferLine re-write to use the grapheme-cluster API.

@Tyriar
Copy link
Member

Tyriar commented Sep 12, 2023

@PerBothner 👌, I think I mentioned it before but we'll want to chat about the high level ideas in the proposed buffer line rewrite before investing too much time to make sure we're all on the same page.

@PerBothner
Copy link
Contributor Author

@Tyriar "we'll want to chat about the high level ideas in the proposed buffer line rewrite"

I agree. There was some discussion in issue #4513, but mixed in with the clustering issue,
I suggest closing #4513, and either creating a new Issue or a "work-in-progress" Pull Request. The former probably makes more sense, as the current state of my branch needs quite a bit of work to be usable.

This issue is pretty important to me. I am seriously considering switching DomTerm engine to xterms.js for various reasons (mainly performance and community). However, DomTerm has a very ambitious vision and already has a lot of features that xterm.js doesn't have. (Hence a lot of DomTerm features would have to be ported or re-implemented.) The existing BufferLine representation is simple but quite limited, and not a good match for some of the things I want to do, including things that DomTerm currently supports.

@Tyriar
Copy link
Member

Tyriar commented Sep 12, 2023

@PerBothner closed, yes please create the issue and we can chat there. It's a bit tough to filter through all this info quickly so it would be good to separate the discussions.

The main goals on my end of the buffer system is fast access and tight packing of the data. The major things I'd like that come to mind for the buffer are deferred clean up/reflow of the arrays (#4232) and leveraging less, larger array buffers to store the data (I think this might be something you were after too). We also have the decoration system which adds a lot of extensibility capabilities which are slower but very flexible.

@PerBothner
Copy link
Contributor Author

FWIW DomTerm already does deferred re-flow. (Partly because re-flow is fairly expensive in the general case, which may include non-trivial pretty-printing and variable-width fonts.) I will cover re-flow in my proposal.

I'll write up the design, goals, and features in a new Issue, hopefully later today or possibly tomorrow.

@PerBothner
Copy link
Contributor Author

See new issue #4800 about a BufferLine re-design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants