Optimizations wrt. long documents #92

TinoDidriksen · 2024-08-28T10:34:05Z

As discussed with @snomos and @Trondtr today, here are some optimizations that my version has, and that are needed:

1) Ability to only check the selected paragraphs

This is important if you're working on a particular part of a long document, and don't want to see markings for other parts.

Technically, it looks at the current selection (or cursor if there is no selection) and expands that range to encompass all paragraphs touched by it. Paragraphs, because you need context to perform corrections, but there is no way to determine sentence boundaries without a full analysis anyway, so paragraph is the smallest usable chunk.

GDocs: https://github.com/GrammarSoft/proofing-gasmso/blob/master/shared/js/backend.gs#L128
Word: https://github.com/GrammarSoft/proofing-gasmso/blob/master/shared/js/impl-officejs.js#L268
Outlook: Can't access selection - have to get whole document.

2) Don't recheck checked paragraphs - cache them

If a paragraph has been checked and there are no changes to it, there is no reason to send it to the backend again.

Technically, every paragraph is stored alongside the backend result for that paragraph, keyed on a hash of the paragraph. Then, when preparing the payload of what to send to the backend, if the hash is in the cache, skip appending this paragraph to the payload.

When parsing the result from the backend, fill in missing paragraphs from the cache. It's important they are still parsed as-if they were from the backend, because the user still wants to see the markings.

3) Asynchronous progress in chunks of 1 KiB

In order to get near-instantaneous results and let the user take action as soon as possible, break up the payload to the backend into max 1 KiB chunks (or larger if it's a fast backend). This also avoids many timeout issues.

Technically, well, really just as said. Cycle goes sendTexts -> parseResult -> sendTexts -> parseResults ... until there are no more paragraphs to be sent. Notice this plays well with caching.

We show a progress bar underneath the current markings, so people can see more is coming. We use 4 KiB chunks for GrammarSoft backends, but I've found 1 KiB is max for the Greenlandic backend to feel responsive, and I bet that holds for the other FST-based backends.

https://github.com/GrammarSoft/proofing-gasmso/blob/master/shared/js/shared.js#L1213

Other notes while I remember

Showing all marking in the sidebar at once is maybe bad. I can imagine that doesn't scale when working with large documents.
No way to add a word to a user dictionary. Though, you don't currently have a user system at all, so fair 'nuff. But that is something Greenlandic users have asked for - even for company-wide dictionaries.
If you don't have a warning for Unicode combining characters, I recommend you add one.
MS Outlook implementation: https://github.com/GrammarSoft/proofing-gasmso/blob/master/shared/js/impl-outlook.js
Adobe InDesign implementation coming soon.
The vast majority of UI and code can be shared across add-ins, and even loaded from the same HTTPS source, so that most changes can be done without needing to go via Google or Microsoft's approval.

The text was updated successfully, but these errors were encountered:

snomos mentioned this issue Sep 20, 2024

Last update made it use enormous amounts of RAM, made it slow #95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations wrt. long documents #92

Optimizations wrt. long documents #92

TinoDidriksen commented Aug 28, 2024 •

edited

Loading

Optimizations wrt. long documents #92

Optimizations wrt. long documents #92

Comments

TinoDidriksen commented Aug 28, 2024 • edited Loading

1) Ability to only check the selected paragraphs

2) Don't recheck checked paragraphs - cache them

3) Asynchronous progress in chunks of 1 KiB

Other notes while I remember

TinoDidriksen commented Aug 28, 2024 •

edited

Loading