Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations wrt. long documents #92

Open
TinoDidriksen opened this issue Aug 28, 2024 · 0 comments
Open

Optimizations wrt. long documents #92

TinoDidriksen opened this issue Aug 28, 2024 · 0 comments

Comments

@TinoDidriksen
Copy link

TinoDidriksen commented Aug 28, 2024

As discussed with @snomos and @Trondtr today, here are some optimizations that my version has, and that are needed:

1) Ability to only check the selected paragraphs

2024-08-28

This is important if you're working on a particular part of a long document, and don't want to see markings for other parts.

Technically, it looks at the current selection (or cursor if there is no selection) and expands that range to encompass all paragraphs touched by it. Paragraphs, because you need context to perform corrections, but there is no way to determine sentence boundaries without a full analysis anyway, so paragraph is the smallest usable chunk.

2) Don't recheck checked paragraphs - cache them

If a paragraph has been checked and there are no changes to it, there is no reason to send it to the backend again.

Technically, every paragraph is stored alongside the backend result for that paragraph, keyed on a hash of the paragraph. Then, when preparing the payload of what to send to the backend, if the hash is in the cache, skip appending this paragraph to the payload.

When parsing the result from the backend, fill in missing paragraphs from the cache. It's important they are still parsed as-if they were from the backend, because the user still wants to see the markings.

3) Asynchronous progress in chunks of 1 KiB

In order to get near-instantaneous results and let the user take action as soon as possible, break up the payload to the backend into max 1 KiB chunks (or larger if it's a fast backend). This also avoids many timeout issues.

Technically, well, really just as said. Cycle goes sendTexts -> parseResult -> sendTexts -> parseResults ... until there are no more paragraphs to be sent. Notice this plays well with caching.

We show a progress bar underneath the current markings, so people can see more is coming. We use 4 KiB chunks for GrammarSoft backends, but I've found 1 KiB is max for the Greenlandic backend to feel responsive, and I bet that holds for the other FST-based backends.

Other notes while I remember

  • Showing all marking in the sidebar at once is maybe bad. I can imagine that doesn't scale when working with large documents.
  • No way to add a word to a user dictionary. Though, you don't currently have a user system at all, so fair 'nuff. But that is something Greenlandic users have asked for - even for company-wide dictionaries.
  • If you don't have a warning for Unicode combining characters, I recommend you add one.
  • MS Outlook implementation: https://github.com/GrammarSoft/proofing-gasmso/blob/master/shared/js/impl-outlook.js
  • Adobe InDesign implementation coming soon.
  • The vast majority of UI and code can be shared across add-ins, and even loaded from the same HTTPS source, so that most changes can be done without needing to go via Google or Microsoft's approval.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant