Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better BR elements handling #12033

Merged
merged 24 commits into from
Jul 20, 2022
Merged

Better BR elements handling #12033

merged 24 commits into from
Jul 20, 2022

Conversation

niegowski
Copy link
Contributor

@niegowski niegowski commented Jul 7, 2022

Suggested merge commit message (convention)

Fix (enter): A <br> element between blocks should be converted to an empty paragraph. Closes #10217.

Test (clipboard): Added test for pasting content with <br> element between blocks. See #10217.

Internal (engine): Handling of <br> elements in the data pipeline moved from DomConverter to the softBreak conversion.


Additional information

The original issue is only about pasting such content (trimmed for readability):

<p>A</p>
<br />
<p>B</p>

This content expects 3 paragraphs and the middle one is empty but got converted (because of auto-paragraphing, softBreak is not allowed in root context) as:

<paragraph>A</paragraph>
<paragraph><softBreak/></paragraph>
<paragraph>B</paragraph>

But since this is very generic HTML and the behavior of <br> elements is very specific in web browsers I did some research to find out how different <br> scenarios are handled by a web browser and CKE5. 

Here is a result (from a manual test: http://fake.ckeditor.com:8125/ckeditor5-enter/tests/manual/softbreak-paragraphing.html):

The broken cases

InputActualExpected
<br>
<p>foo</p>
<p><br>&nbsp;</p>
<p>foo</p>
<p>&nbsp;</p>
<p>foo</p>
<p>before</p>
<br>
<p>after</p>
<p>before</p>
<p><br>&nbsp;</p>
<p>after</p>
<p>before</p>
<p>&nbsp;</p>
<p>after</p>
<p>before</p>
foo<br>
<p>after</p>
<p>before</p>
<p>foo<br>&nbsp;</p>
<p>after</p>
<p>before</p>
<p>foo</p>
<p>after</p>
<p>before</p>
foo<br> {whitespaces}
<p>after</p>
<p>before</p>
<p>foo<br>&nbsp;</p>
<p>after</p>
<p>before</p>
<p>foo</p>
<p>after</p>
<p>before</p>
<p>foo<br> {whitespaces} </p>
<p>after</p>
<p>before</p>
<p>foo<br>&nbsp;</p>
<p>after</p>
<p>before</p>
<p>foo</p>
<p>after</p>
<p>foo</p>
<br>
<p>foo</p>
<p><br>&nbsp;</p>
<p>foo</p>
<p>&nbsp;</p>

Above cases were failing also while <br> was wrapped with some inline elements (like a link or strong).

The only case that was explicitly handled by the editor was <p><br></p> and it was fragile to any whitespace around BR element.

First solution attempt

I was trying to solve the issue by extending the <p><br></p> case that was handled in DomConverter:

// Special case for <p><br></p> in which <br> should be treated as filler even when we are not in the 'br' mode. See ckeditor5#5564.
if ( domNode.tagName === 'BR' && hasBlockParent( domNode, this.blockElements ) && domNode.parentNode.childNodes.length === 1 ) {
return true;
}

But it had problems:

  • it required using a lot of text trimming logic (that logic is already implemented in DomConverter but it would require triggering it multiple times to handle look-ahead checks (for checking block fillers and to trim text itself)
  • it was a "hack" because in the data pipeline only NBSP block fillers are legit, BRs were an additional workaround because they would not land normally in the data pipeline
  • while replacing <br/> by a <p></p> in some cases, it generated paragraphs nested in other paragraphs or other block elements that could contain text so the generated view structure would be messed up (but upcast conversion to the model would handle it properly)

Final solution

I decided to move the main solution to softBreak upcast converter because:

  • the proper NBSP block fillers would be already removed
  • all text nodes would be properly trimmed (especially useful for lookahead checks)
  • scenarios of handling BR elements in a specific contexts are much clearer there

You could check how browsers and the editor are handling <br>s in some cases in the new manual test.

@niegowski niegowski changed the title [WiP] Better block filler handling [WiP] Better BR elements handling Jul 8, 2022
@niegowski niegowski marked this pull request as ready for review July 11, 2022 13:52
@niegowski niegowski changed the title [WiP] Better BR elements handling Better BR elements handling Jul 11, 2022
oleq
oleq previously requested changes Jul 12, 2022
Copy link
Member

@oleq oleq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality looks OK.

  • There's a problem with GHS.
    2022-07-12 12 29 58
  • I'd add a regular test for paste from office just for the reference and future developers who may come across this issue.

@niegowski
Copy link
Contributor Author

I just checked how it works while BR is wrapped with some inline element and it differs from the browser rendering. Checking it.

@niegowski niegowski marked this pull request as draft July 13, 2022 15:58
@niegowski niegowski marked this pull request as draft July 13, 2022 15:58
@niegowski niegowski marked this pull request as ready for review July 14, 2022 15:53
@Reinmar
Copy link
Member

Reinmar commented Jul 18, 2022

Internal (engine): Handling of <br> elements in the data pipeline moved from DomConverter to the softBreak conversion.

I don't know whether we can call this an internal change. The feature was moved from the engine to a plugin. Did you think on how the behavior of the editor without SoftBreak changed? A thing to consider is that it also touches pasting. Editors without this plugin will be quite rare so we can be rather liberal here, but let's at least have this thought through.

@Reinmar
Copy link
Member

Reinmar commented Jul 18, 2022

Thanks for writing down these thoughts. It's an interesting idea to move the code to the plugin. It's definitely more elegant if <br> handling is done on a conversion layer – fewer things to worry about this at that stage.

OTOH, we're now moving logic that was resolving issues related to how semantic data has to be enhanced to show well in a browser (or strip from these enhancements) to the conversion layer. But on the third hand, that's where we clean up the content and fix where it was misformatted (e.g. autoparagraphs, wrong nesting, disallowed children, etc.).

So, an important thing to consider here is that those BRs are not part of the enhancement that we did. We use <br>s only in the BR mode and those are still "deenhanced" correctly by DomConverter: which guarantees its symmetricity. Other <br>s can be part of the cleanup done on the conversion layer.

So after giving it a deeper thought, I think the change makes sense.

@Reinmar
Copy link
Member

Reinmar commented Jul 18, 2022

One more thing to consider: Performance.

We had poor implementation, but probably negligible from the performance pov. Now, it's more complete, but at what cost?

@Reinmar Reinmar requested a review from oleq July 18, 2022 09:48
@Reinmar Reinmar dismissed oleq’s stale review July 18, 2022 09:48

Outdated.

@@ -514,6 +514,43 @@ describe( 'Input feature', () => {
expect( getViewData( view ) ).to.equal( '<p><strong>Foo</strong> {}</p>' );
} );

it( 'should handle bogus br correctly (with custom br to softBreak converter)', () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you come up with these tests? Did you encounter some regressions after making this change? Does it anyhow affect beforeinput-related work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case is that those bogus BRs are stripped while converting to the model (input compares the current model and the model generated from the modified DOM). This case won't affect the beforeinput because it's not used there anymore.

@Reinmar
Copy link
Member

Reinmar commented Jul 18, 2022

The first test I done was ctrl+a, ctrl+c in this page (this PR) and paste it to http://localhost:8125/ckeditor5/tests/manual/all-features.html

Unfortunately, on this branch crashes with:

converters.js:849 Uncaught TypeError: Cannot read properties of null (reading 'start')
    at viewToModelListItemChildrenConverter (converters.js:849:1)
    at UpcastDispatcher.viewModelConverter (converters.js:388:1)
    at UpcastDispatcher.fire (emittermixin.js:200:1)
    at UpcastDispatcher._convertItem (upcastdispatcher.js:258:1)
    at UpcastDispatcher._convertChildren (upcastdispatcher.js:291:1)
    at UpcastDispatcher.<anonymous> (upcasthelpers.js:441:1)
    at UpcastDispatcher.fire (emittermixin.js:200:1)
    at UpcastDispatcher._convertItem (upcastdispatcher.js:258:1)
    at UpcastDispatcher._convertChildren (upcastdispatcher.js:291:1)
    at UpcastDispatcher.<anonymous> (upcasthelpers.js:790:1)

The best way to test conversion and anything related to that is to take random pieces of HTML from various websites and just paste :D

@Reinmar
Copy link
Member

Reinmar commented Jul 18, 2022

Scenarios to check:

  • Using shft+enter in "creative" places (including creating multiple breaks) and doing editor.setData( editor.getData() ) to check that nothing was lost.
  • Pasting content from various websites to  check the stability.
  • Smoketesting pasting from Google Docs and Word.
  • Performance comparison before vs after.

@Reinmar
Copy link
Member

Reinmar commented Jul 18, 2022

I was curious whether the PR indeed fixes the original issue. So, how paste from this document works: https://docs.google.com/document/d/1abo1Ca_RHEmuXnAX6k_z0itHihQb3g9rJRM3x4vQ25w/edit

NOTE: This document is really oddly formatted. When you test a normal document, created with normal paragraphs in Google Docs (I tested some other docs I had), the result is better. So, this entire PR is about fixing a rather specific case, although, apparently, relatively frequent as well.

Before

<p>
    <span style="background-color:transparent;color:#000000;">Sometimes, there’s nothing scarier than a blank page waiting to be filled - especially when it comes to figuring out how to start a resume.&nbsp;</span>
</p>
<p>
    <br>
    &nbsp;
</p>
<p>
    <span style="background-color:transparent;color:#000000;">For every position that you apply for, you will need to stand out in a pile of applications through your professional experience, achievements, and education, to impress recruiters.&nbsp;</span>
</p>
<p>
    <br>
    &nbsp;
</p>
<p>
    <span style="background-color:transparent;color:#000000;">Just thinking about all this may make starting your resume seem like a Herculean task.&nbsp;</span>
</p>
<p>
    <br>
    &nbsp;
</p>

On this PR

<p>
    <span style="background-color:transparent;color:#000000;">Sometimes, there’s nothing scarier than a blank page waiting to be filled - especially when it comes to figuring out how to start a resume.&nbsp;</span>
</p>
<p>
    &nbsp;
</p>
<p>
    <span style="background-color:transparent;color:#000000;">For every position that you apply for, you will need to stand out in a pile of applications through your professional experience, achievements, and education, to impress recruiters.&nbsp;</span>
</p>
<p>
    &nbsp;
</p>
<p>
    <span style="background-color:transparent;color:#000000;">Just thinking about all this may make starting your resume seem like a Herculean task.&nbsp;</span>
</p>
<p>
    &nbsp;
</p>

Hmm... there are still empty paragraphs that aren't visible in the original document. If paragraphs are styled to have margins (as in our docs) it looks like this:

So, differently than in Google Docs. However, as I mentioned,  the problem is with the styling, not data.

I decided to check CKE4 and TinyMCE's powerpaste out of curiosity too as they worked on their filtering systems for a couple more years. Yet, the results are identical.

CKEditor 4

TinyMCE

To sum up

The PR resolves the reported issue but it's probably even impossible to ensure that the content look very much alike between Google Docs and other systems.

@Reinmar
Copy link
Member

Reinmar commented Jul 18, 2022

I did some quick performance tests and for a 12kb of content that's relatively heavy with <br>s, the difference was negligible: 190ms -> 210ms.

Co-authored-by: Piotrek Koszuliński <[email protected]>
@niegowski
Copy link
Contributor Author

Unfortunately, on this branch crashes with:

converters.js:849 Uncaught TypeError: Cannot read properties of null (reading 'start')
    at viewToModelListItemChildrenConverter (converters.js:849:1)
    at UpcastDispatcher.viewModelConverter (converters.js:388:1)
    at UpcastDispatcher.fire (emittermixin.js:200:1)
    at UpcastDispatcher._convertItem (upcastdispatcher.js:258:1)
    at UpcastDispatcher._convertChildren (upcastdispatcher.js:291:1)
    at UpcastDispatcher.<anonymous> (upcasthelpers.js:441:1)
    at UpcastDispatcher.fire (emittermixin.js:200:1)
    at UpcastDispatcher._convertItem (upcastdispatcher.js:258:1)
    at UpcastDispatcher._convertChildren (upcastdispatcher.js:291:1)
    at UpcastDispatcher.<anonymous> (upcasthelpers.js:790:1)

I narrowed the crashing content to:

<ul>
	<li>foo<br><custom-element></custom-element></li>
</ul>

@niegowski
Copy link
Contributor Author

I narrowed the crashing content to:

<ul>
	<li>foo<br><custom-element></custom-element></li>
</ul>

So the problem is that <custom-element> and any inline object (like <img>) were ignored while looking for the BR's sibling so the BR converter consumed it but without providing any model element. This crashed on this line:

const convertedChild = result.modelRange.start.nodeAfter;

because there is no result.modelRange for not converted element.

The case of known inline object elements (like <img>) is now solved but the case where there is an unknown inline object is not solved. These cases are fixed:

foo<br><img/>
foo<br><custom-element><img/></custom-element>
<p>foo<br><img/></p>
<p>foo<br><custom-element><img/></custom-element></p>

But if the custom element does not contain any content then it crashes:

foo<br><custom-element></custom-element>

But there might be more cases where this would fail because of the line mentioned above.

@Reinmar
Copy link
Member

Reinmar commented Jul 19, 2022

I'm worried that in ab3e3ad you added just a single test. I'd generate a couple more for weird variations: elements before, multiple elements one by one, other inline elements, etc. I found that the tree-walking logic is often incomplete.

@Reinmar
Copy link
Member

Reinmar commented Jul 19, 2022

I did some more smoke tests and also checked whether we didn't start losing more <br>s. The results are good:

(left is on the PR, right is master)

If there are differences, they look like this one.

@niegowski
Copy link
Contributor Author

I found another case that I missed before:

<p><br><br><br></p>

should remove only the last BR element, others should be converted as softBreaks (it's fixed in the last commit).

@niegowski
Copy link
Contributor Author

More details on the crashing case (fixed in one of the previous commits):

The input data:

<ul>
	<li>foo<br></li>
</ul>

The crashing line was:

const convertedChild = result.modelRange.start.nodeAfter;

because of the assumption that conversionApi.convertItem() would always convert a given view item and the result.modelRange would be there. This is mostly true because even if some element does not get converted by any dedicated feature then the fallback converter kicks in and converts the element content so there almost always is some modelRange:

if ( !data.modelRange && conversionApi.consumable.consume( data.viewItem, { name: true } ) ) {
const { modelRange, modelCursor } = conversionApi.convertChildren( data.viewItem, data.modelCursor );
data.modelRange = modelRange;
data.modelCursor = modelCursor;
}

there is one exception, the fallback converter expects that the element was not consumed and modelRange was not set by any converter. In the BR case, the element itself was consumed so the fallback converter did not kick in and modelRange was left unset. BR converter must consume the view element because otherwise the GHS conversion would try to convert it and we don't want that.

So if any feature is using conversionApi.convertItem() and is not verifying whether the given view node got converted then it could crash (in the feature code). Of course, this applies to generic cases that do not use it to convert a specific view element (we use it in the table, and image features but we pass a very specific element that we know will be converted - img or table). The conversionApi.convertChildren() is safe because it verifies whether the modelRange was returned from the single child conversion and also always returns the range (sometimes collapsed at the modelCursor position).

@Reinmar
Copy link
Member

Reinmar commented Jul 19, 2022

So if any feature is using conversionApi.convertItem() and is not verifying whether the given view node got converted then it could crash (in the feature code). Of course, this applies to generic cases that do not use it to convert a specific view element (we use it in the table, and image features but we pass a very specific element that we know will be converted - img or table). The conversionApi.convertChildren() is safe because it verifies whether the modelRange was returned from the single child conversion and also always returns the range (sometimes collapsed at the modelCursor position).

In other words, the frequency will be really low. There has to be a combination of a custom converter, using convertItem(), not checking the range, and having <br> in the content inside that element.

@niegowski
Copy link
Contributor Author

In other words, the frequency will be really low. There has to be a combination of a custom converter, using convertItem(), not checking the range, and having <br> in the content inside that element.

Yes.

@dufipl
Copy link
Contributor

dufipl commented Jul 19, 2022

At this moment after some first testing and going through those points mainly:

Scenarios to check:

  • Using shft+enter in "creative" places (including creating multiple breaks) and doing editor.setData( editor.getData() ) to check that nothing was lost.
  • Pasting content from various websites to  check the stability.
  • Smoketesting pasting from Google Docs and Word.

I was not able to crash the editor or see other issues.

I can see improvement regarding the original ticket and only the styling issue (line spacing) remained, but it was discussed above.

So to sum up, I haven't spotted new issues, but I think we should test this heavily during the Testing Phase.

@Reinmar
Copy link
Member

Reinmar commented Jul 20, 2022

CI was clean apart from a usual 404 in manual tests.

@Reinmar Reinmar merged commit 394a19b into master Jul 20, 2022
@Reinmar Reinmar deleted the ck/10217 branch July 20, 2022 05:03
@Reinmar
Copy link
Member

Reinmar commented Jul 29, 2022

This change will be reverted in #12183.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extra paragraph and break elements when copy/pasting from Google Docs
4 participants