RichText: internal value refactoring idea #29870

ellatrix · 2021-03-15T12:40:57Z

Currently, RichText's internal value has separated text from formatting, which is great for searching the text and easily applying formatting. The formatting data is a sparse array that is filled at the positions where the text is respectively formatted. For example, if text is formatted from position 2-5 in bold and from position 2-10 with a link, then you'll find arrays in the sparse array from position 2-10 that includes the bold format from 2-5 and the link format from 2-10.

This format is great for manipulating the data, but not so great for serialising it back to a tree format (DOM or HTML). In order to do so, you need to go over each position one by one and build up a tree based on what formats you have.

How can we make that better?

One way could be to store formats with ranges in a Map, for example { type: 'bold' }: [ 2, 5 ]. We can then make a custom iterator that gives us the formats in the right order for a tree: ordered by range start and then the length of the range.

Now we no longer have to go over each position to build up a tree. We can add slices of text in one go as we go over formats in order.

Format APIs should also be straightforward to implement. applyFormat( value, format, start, end ) is just a matter of adding the format to the Map. For slice it would mean splitting the intersecting formats at a certain position. Some operations become easier, some a bit harder, but that's ok. The area where we should optimise for speed is converting to/from a tree, not the value operations, which run significantly less often.

Thoughts?

The text was updated successfully, but these errors were encountered:

dmsnell · 2021-03-21T21:32:55Z

@ellatrix its exciting to see you bring your attention to this. RichText is awesome.

Do we have any hard data making clear the performance implications for this? Is it a nice to have? A should have always had it? or a necessary escape from a current bottleneck?

We’ve used the range tree on WordPress.com notifications without hassle but those are also very small documents that are essentially read-once.

My experience is that range trees are convenient for building DOM and inconvenient for editing, as the burden lies on transforming an edited tree to prevent overlapping ranges and improperly nested elements. The current strategy is trivial in this regard because, by construction, we cannot have overlapping regions.

Have you built any prototype to compare against?

ellatrix · 2021-03-22T14:48:58Z

My experience is that range trees are convenient for building DOM and inconvenient for editing, as the burden lies on transforming an edited tree to prevent overlapping ranges and improperly nested elements. The current strategy is trivial in this regard because, by construction, we cannot have overlapping regions.

That's right. The current data structure is designed for making it easy to insert, split, remove etc. Actions where we need to reset the start and end values of the formats and objects.

There's only two things that need to be fast in RichText though: inserting a character on input (typing) and rebuilding a tree from that.

We're not even directly inserting a character on input right now, and I don't see this ever changing. Right now, we look at the DOM the browser created, and then build a new value from the DOM, which is fast now and which would stay fast with the proposed structure.
- Let's say we do want to insert a character though. It should just be a matter of incrementing all ranges after a certain index?
Building a tree from the value to re-apply it to the DOM is currently the slowest thing in RichText because we looping over each character in the value and slowly building a tree from that. If formats and objects are stored as ranges, we just have to loop over these in the right order and append text in batches.

We need to rebuild a tree on every keystroke, so this aspect is quite important for RichText performance, especially for larger amounts of texts like a long list (which is currently entirely handled by RichText).

Do we have any hard data making clear the performance implications for this?

No, it's just based on the above.

Have you built any prototype to compare against?

I haven't build anything so far, this is just me dumping my ideas for the record.

ellatrix · 2021-03-22T14:52:30Z

We’ve used the range tree on WordPress.com notifications without hassle but those are also very small documents that are essentially read-once.

What's a range tree? My proposal isn't a tree at all. We'd keep the flat text and have a Map that stores formats-range pairs.

dmsnell · 2021-03-23T01:14:53Z

Let's say we do want to insert a character though. It should just be a matter of incrementing all ranges after a certain index?

I'm not sure this is a great assumption. If the character falls at the boundary we might have to compute whether or not to extend or shift ranges. This might also be something we currently have to deal with anyway.

Building a tree from the value to re-apply it to the DOM is currently the slowest thing in RichText because we looping over each character in the value and slowly building a tree from that. If formats and objects are stored as ranges, we just have to loop over these in the right order and append text in batches.

We might be doing something very inefficient with this. The current model works best at maintaining text models, but you're reminding me that we do everything on every change, otherwise we wouldn't be able to update the editor. Even if we make the change wouldn't we still be going full circle on every input event?

parse from the DOM to update the model
print the model to generate the new VDOM/DOM

What's a range tree? My proposal isn't a tree at all. We'd keep the flat text and have a Map that stores formats-range pairs.

a Map that stores formats and range pair might end up identical to a de-optimized tree. if we're looking at formats with ranges…

{
  'em': [{49, 51}],
  'strong': [{1, 3}, {48, 53}]
}

…then we might find quickly-staggering complexity to sort them all out. I had assumed you were discussing an ordered set of ranges for in-editor updates since those can be somewhat optimized, especially for lookup.

{'ranges': [
  {1, 3, [], 'strong'},
  {48, 53, [{49, 51, [], 'em'}], 'strong'}
]}

are you saying we read from the DOM itself inside the RichText editor whenever running onInput (or similar callbacks)? there could be a much more significant overhead in jumping between DOM contexts than on handling formats. I'd be interesting to find some good profiling on this and see where the costs spread out.

ellatrix · 2021-03-24T09:04:55Z

The structure I'm proposing stores ranges by format instance, not by format type.

{
  { type: 'core/italic' }: [ 49, 51 ],
  { type: 'core/bold' }: [ 1, 3 ],
  { type: 'core/bold' }: [ 48, 53 ],
}

If you enter a character at index 2, then all numbers after index 2 must be increased by 1. That seems simple enough. The same thing needs to happen at any index? Increase all numbers after the index, leave the rest alone?

I'd have to make a quick prototype to see if there's any problems with this.

I've checked the performance of current rich text, and it's fairly good, but that's all for small pieces of rich text which is what we have most of the time in Gutenberg. Actually, the performance of the block editor is many times worse than rich text. If you type in a random paragraph, the time it takes to process rich text is way smaller than the time it takes to process the update for the block editor. If we want to do any significant performance improvement, we should probably be looking at the data stores (cc @youknowriad).

I suspect the proposed structure here will only really affect larger rich text values, such as a big list, and I purely made this issue from my desire to make rich text as good as it can be. :) So the priority here is not so high, but I think it would be fun to figure out what the best structure would be.

dmsnell · 2021-03-24T21:58:16Z

One thing to be careful with this structure is overlapping regions. That's a big problem for HTML representations too. What does this mean?

{
  { type: 'core/italic' }: [1, 8],
  { type: 'core/bold' }: [5, 13]
}

At least at its face that this cannot map back to HTML without transforming back into a proper tree.

ellatrix · 2021-03-25T07:28:57Z

It's possible to do that, but it needs to be split: 01...45...89...13.

We have the same issue with the current data structure, though it's handled in applyFormat.

For the new structure, em will be added first, then strong, then all nested tags (strong) need to be closed before we can close em, then reopen any tags that are still open according to the ranges.

Sure, it's something extra to handle, but it doesn't seem too crazy.

dmsnell · 2021-03-25T17:59:39Z

Sure, it's something extra to handle, but it doesn't seem too crazy.

total agreement. though the question I have is whether after doing this we will still have the performance advantage over the current structure, since converting back to the tree is not a trivial operation.

would love to see benchmarks

dmsnell · 2022-08-13T17:01:11Z

We need to make sure we are very explicit in defining what string indexes count.

Possibilities:

UTF-16 code units
UTF-8 code units
Unicode code points
Extended grapheme clusters

Let's apply a format to the female pilot; how do we get the range?

JavaScript

const s = ' 👩‍✈️ ';

// utf16 code units is how JS counts characters/strings
start = s.indexOf( '👩‍✈️' )
end = start + '👩‍✈️'.length
JSON = { text: ' 👩‍✈️ ', formats: { 'bold': [ [ 1, 6 ] ] } };

const toUTF8 = ( s, index ) => {
	s = s.slice( 0, index )
	s = encodeURIComponent( s )
        s = s.replace( /%../g, '.' )
	return s.length;
}
JSON = { text: ' 👩‍✈️ ', formats: { 'bold': [ [ 1, 14 ] ] } };

const toCodePointCounts = ( s, index ) => {
	return [ ...( s.slice( 0, index ) ) ].length;
}
JSON = { text: ' 👩‍✈️ ', formats: { 'bold': [ [ 1, 5 ] ] } };

const toGraphemeClusterCounts = ( s, index ) => {
	s = s.slice( 0, index )
	return someLibrary.clusterCount( s )
}
JSON = { text: ' 👩‍✈️ ', formats: { 'bold': [ [ 1, 2 ] ] } };

youknowriad · 2024-10-24T16:44:46Z

I think you've managed to solve this right @ellatrix what do you think is left here?

ellatrix · 2024-10-25T16:38:35Z

@youknowriad Actually no, this is still open. I simply haven't had the time to finish the PR, it's not been a priority. I do want to get to it when I find some time.

ellatrix added [Feature] Rich Text Related to the Rich Text component that allows developers to render a contenteditable [Package] Rich text /packages/rich-text [Type] Discussion For issues that are high-level and not yet ready to implement. labels Mar 15, 2021

ellatrix self-assigned this May 21, 2021

ellatrix mentioned this issue Sep 6, 2021

RichText: internal value refactor #34359

Draft

7 tasks

ellatrix mentioned this issue Oct 22, 2021

Split control for URL and Text within Link UI #33849

Merged

7 tasks

ellatrix mentioned this issue Aug 25, 2022

Rich text: log deprecation message for multiline prop #43613

Merged

ellatrix added this to @ellatrix's backlog Oct 17, 2022

ellatrix mentioned this issue Sep 15, 2023

Rich text: refactor toTree #54512

Draft

This was referenced Nov 13, 2023

RichText: pass value to store #43204

Merged

Rich text: remove OBJECT_REPLACEMENT_CHARACTER and deprecated getTextContent #56111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RichText: internal value refactoring idea #29870

RichText: internal value refactoring idea #29870

ellatrix commented Mar 15, 2021

dmsnell commented Mar 21, 2021

ellatrix commented Mar 22, 2021 •

edited

Loading

ellatrix commented Mar 22, 2021

dmsnell commented Mar 23, 2021

ellatrix commented Mar 24, 2021

dmsnell commented Mar 24, 2021

ellatrix commented Mar 25, 2021

dmsnell commented Mar 25, 2021

dmsnell commented Aug 13, 2022

youknowriad commented Oct 24, 2024

ellatrix commented Oct 25, 2024

RichText: internal value refactoring idea #29870

RichText: internal value refactoring idea #29870

Comments

ellatrix commented Mar 15, 2021

dmsnell commented Mar 21, 2021

ellatrix commented Mar 22, 2021 • edited Loading

ellatrix commented Mar 22, 2021

dmsnell commented Mar 23, 2021

ellatrix commented Mar 24, 2021

dmsnell commented Mar 24, 2021

ellatrix commented Mar 25, 2021

dmsnell commented Mar 25, 2021

dmsnell commented Aug 13, 2022

youknowriad commented Oct 24, 2024

ellatrix commented Oct 25, 2024

ellatrix commented Mar 22, 2021 •

edited

Loading