Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RichText: internal value refactoring idea #29870

Open
ellatrix opened this issue Mar 15, 2021 · 11 comments
Open

RichText: internal value refactoring idea #29870

ellatrix opened this issue Mar 15, 2021 · 11 comments
Assignees
Labels
[Feature] Rich Text Related to the Rich Text component that allows developers to render a contenteditable [Package] Rich text /packages/rich-text [Type] Discussion For issues that are high-level and not yet ready to implement.

Comments

@ellatrix
Copy link
Member

Currently, RichText's internal value has separated text from formatting, which is great for searching the text and easily applying formatting. The formatting data is a sparse array that is filled at the positions where the text is respectively formatted. For example, if text is formatted from position 2-5 in bold and from position 2-10 with a link, then you'll find arrays in the sparse array from position 2-10 that includes the bold format from 2-5 and the link format from 2-10.

This format is great for manipulating the data, but not so great for serialising it back to a tree format (DOM or HTML). In order to do so, you need to go over each position one by one and build up a tree based on what formats you have.

How can we make that better?

One way could be to store formats with ranges in a Map, for example { type: 'bold' }: [ 2, 5 ]. We can then make a custom iterator that gives us the formats in the right order for a tree: ordered by range start and then the length of the range.

Now we no longer have to go over each position to build up a tree. We can add slices of text in one go as we go over formats in order.

Format APIs should also be straightforward to implement. applyFormat( value, format, start, end ) is just a matter of adding the format to the Map. For slice it would mean splitting the intersecting formats at a certain position. Some operations become easier, some a bit harder, but that's ok. The area where we should optimise for speed is converting to/from a tree, not the value operations, which run significantly less often.

Thoughts?

@ellatrix ellatrix added [Feature] Rich Text Related to the Rich Text component that allows developers to render a contenteditable [Package] Rich text /packages/rich-text [Type] Discussion For issues that are high-level and not yet ready to implement. labels Mar 15, 2021
@dmsnell
Copy link
Member

dmsnell commented Mar 21, 2021

@ellatrix its exciting to see you bring your attention to this. RichText is awesome.

Do we have any hard data making clear the performance implications for this? Is it a nice to have? A should have always had it? or a necessary escape from a current bottleneck?

We’ve used the range tree on WordPress.com notifications without hassle but those are also very small documents that are essentially read-once.

My experience is that range trees are convenient for building DOM and inconvenient for editing, as the burden lies on transforming an edited tree to prevent overlapping ranges and improperly nested elements. The current strategy is trivial in this regard because, by construction, we cannot have overlapping regions.

Have you built any prototype to compare against?

@ellatrix
Copy link
Member Author

ellatrix commented Mar 22, 2021

My experience is that range trees are convenient for building DOM and inconvenient for editing, as the burden lies on transforming an edited tree to prevent overlapping ranges and improperly nested elements. The current strategy is trivial in this regard because, by construction, we cannot have overlapping regions.

That's right. The current data structure is designed for making it easy to insert, split, remove etc. Actions where we need to reset the start and end values of the formats and objects.

There's only two things that need to be fast in RichText though: inserting a character on input (typing) and rebuilding a tree from that.

  1. We're not even directly inserting a character on input right now, and I don't see this ever changing. Right now, we look at the DOM the browser created, and then build a new value from the DOM, which is fast now and which would stay fast with the proposed structure.
    • Let's say we do want to insert a character though. It should just be a matter of incrementing all ranges after a certain index?
  2. Building a tree from the value to re-apply it to the DOM is currently the slowest thing in RichText because we looping over each character in the value and slowly building a tree from that. If formats and objects are stored as ranges, we just have to loop over these in the right order and append text in batches.

We need to rebuild a tree on every keystroke, so this aspect is quite important for RichText performance, especially for larger amounts of texts like a long list (which is currently entirely handled by RichText).

Do we have any hard data making clear the performance implications for this?

No, it's just based on the above.

Have you built any prototype to compare against?

I haven't build anything so far, this is just me dumping my ideas for the record.

@ellatrix
Copy link
Member Author

We’ve used the range tree on WordPress.com notifications without hassle but those are also very small documents that are essentially read-once.

What's a range tree? My proposal isn't a tree at all. We'd keep the flat text and have a Map that stores formats-range pairs.

@dmsnell
Copy link
Member

dmsnell commented Mar 23, 2021

Let's say we do want to insert a character though. It should just be a matter of incrementing all ranges after a certain index?

I'm not sure this is a great assumption. If the character falls at the boundary we might have to compute whether or not to extend or shift ranges. This might also be something we currently have to deal with anyway.

Building a tree from the value to re-apply it to the DOM is currently the slowest thing in RichText because we looping over each character in the value and slowly building a tree from that. If formats and objects are stored as ranges, we just have to loop over these in the right order and append text in batches.

We might be doing something very inefficient with this. The current model works best at maintaining text models, but you're reminding me that we do everything on every change, otherwise we wouldn't be able to update the editor. Even if we make the change wouldn't we still be going full circle on every input event?

  • parse from the DOM to update the model
  • print the model to generate the new VDOM/DOM

What's a range tree? My proposal isn't a tree at all. We'd keep the flat text and have a Map that stores formats-range pairs.

a Map that stores formats and range pair might end up identical to a de-optimized tree. if we're looking at formats with ranges…

{
  'em': [{49, 51}],
  'strong': [{1, 3}, {48, 53}]
}

…then we might find quickly-staggering complexity to sort them all out. I had assumed you were discussing an ordered set of ranges for in-editor updates since those can be somewhat optimized, especially for lookup.

{'ranges': [
  {1, 3, [], 'strong'},
  {48, 53, [{49, 51, [], 'em'}], 'strong'}
]}

are you saying we read from the DOM itself inside the RichText editor whenever running onInput (or similar callbacks)? there could be a much more significant overhead in jumping between DOM contexts than on handling formats. I'd be interesting to find some good profiling on this and see where the costs spread out.

@ellatrix
Copy link
Member Author

The structure I'm proposing stores ranges by format instance, not by format type.

{
  { type: 'core/italic' }: [ 49, 51 ],
  { type: 'core/bold' }: [ 1, 3 ],
  { type: 'core/bold' }: [ 48, 53 ],
}

If you enter a character at index 2, then all numbers after index 2 must be increased by 1. That seems simple enough. The same thing needs to happen at any index? Increase all numbers after the index, leave the rest alone?

I'd have to make a quick prototype to see if there's any problems with this.

I've checked the performance of current rich text, and it's fairly good, but that's all for small pieces of rich text which is what we have most of the time in Gutenberg. Actually, the performance of the block editor is many times worse than rich text. If you type in a random paragraph, the time it takes to process rich text is way smaller than the time it takes to process the update for the block editor. If we want to do any significant performance improvement, we should probably be looking at the data stores (cc @youknowriad).

I suspect the proposed structure here will only really affect larger rich text values, such as a big list, and I purely made this issue from my desire to make rich text as good as it can be. :) So the priority here is not so high, but I think it would be fun to figure out what the best structure would be.

@dmsnell
Copy link
Member

dmsnell commented Mar 24, 2021

One thing to be careful with this structure is overlapping regions. That's a big problem for HTML representations too. What does this mean?

{
  { type: 'core/italic' }: [1, 8],
  { type: 'core/bold' }: [5, 13]
}

At least at its face that this cannot map back to HTML without transforming back into a proper tree.

@ellatrix
Copy link
Member Author

It's possible to do that, but it needs to be split: 0<em>1...4<strong>5...8</strong></em><strong>9...13</strong>.

We have the same issue with the current data structure, though it's handled in applyFormat.

For the new structure, em will be added first, then strong, then all nested tags (strong) need to be closed before we can close em, then reopen any tags that are still open according to the ranges.

Sure, it's something extra to handle, but it doesn't seem too crazy.

@dmsnell
Copy link
Member

dmsnell commented Mar 25, 2021

Sure, it's something extra to handle, but it doesn't seem too crazy.

total agreement. though the question I have is whether after doing this we will still have the performance advantage over the current structure, since converting back to the tree is not a trivial operation.

would love to see benchmarks

@dmsnell
Copy link
Member

dmsnell commented Aug 13, 2022

We need to make sure we are very explicit in defining what string indexes count.

Possibilities:

  • UTF-16 code units
  • UTF-8 code units
  • Unicode code points
  • Extended grapheme clusters

Let's apply a format to the female pilot; how do we get the range?

JavaScript

const s = ' 👩‍✈️ ';

// utf16 code units is how JS counts characters/strings
start = s.indexOf( '👩‍✈️' )
end = start + '👩‍✈️'.length
JSON = { text: ' 👩‍✈️ ', formats: { 'bold': [ [ 1, 6 ] ] } };

const toUTF8 = ( s, index ) => {
	s = s.slice( 0, index )
	s = encodeURIComponent( s )
        s = s.replace( /%../g, '.' )
	return s.length;
}
JSON = { text: ' 👩‍✈️ ', formats: { 'bold': [ [ 1, 14 ] ] } };

const toCodePointCounts = ( s, index ) => {
	return [ ...( s.slice( 0, index ) ) ].length;
}
JSON = { text: ' 👩‍✈️ ', formats: { 'bold': [ [ 1, 5 ] ] } };

const toGraphemeClusterCounts = ( s, index ) => {
	s = s.slice( 0, index )
	return someLibrary.clusterCount( s )
}
JSON = { text: ' 👩‍✈️ ', formats: { 'bold': [ [ 1, 2 ] ] } };

@youknowriad
Copy link
Contributor

I think you've managed to solve this right @ellatrix what do you think is left here?

@ellatrix
Copy link
Member Author

@youknowriad Actually no, this is still open. I simply haven't had the time to finish the PR, it's not been a priority. I do want to get to it when I find some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Rich Text Related to the Rich Text component that allows developers to render a contenteditable [Package] Rich text /packages/rich-text [Type] Discussion For issues that are high-level and not yet ready to implement.
Projects
None yet
Development

No branches or pull requests

3 participants