Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure html attachments without <tbody> elements duplicate successfully #250

Conversation

davidgisbey
Copy link
Contributor

@davidgisbey davidgisbey commented Aug 12, 2022

Description

At the moment, there's a bug where some attachments are not being successfully duped when a new edition is created by the Edition#create_draft method. We'e received a couple (or more) Zendesk tickets regarding the issue.

After some digging it's become clear that this occurs when a user inputs questionable HMTL for tables. Specifically when they don't use a <tbody> for the body of the table.

When a user saves a HTML attachment on Whitehall it's validates it via the Govspeak::HtmlValidator.

This calls the Document#to_html method. When we pass sanitize: true into the options we then call the Sanitize#sanitize method from the Sanitize gem which uses the fragment method. Here's the code:

def fragment(html)
  return '' unless html

  frag = Nokogiri::HTML5.fragment(preprocess(html), **@config[:parser_options])
  node!(frag)
  to_html(frag)
end

https://github.com/rgrove/sanitize/blob/main/lib/sanitize.rb#L66-L68

The Nokogiri::HTML5.fragment has some cool, but slightly unexpected behaviour that is outlined here
https://makandracards.com/makandra/481802-how-to-prevent-nokogiri-from-fixing-invalid-html

Essentially though it fixes up invalid HTML. One side effect is that it can add extra linebreaks that were not there before, which is happening in this case.

By the time the HMTL is compared with and without sanitisation in the Validator, subtle whitespace changes have snuck in which causes the == operator to fail.

While we could definitely implement a bespoke fix for this issue, we've decided to go with removing linebreaks from the html before comparing it with and without sanitisation. It's largely just noise and could definitely cause issues down the line again when further releases occur.

Trello card

https://trello.com/c/HSlot63V/623-investigate-bug-html-attachments-sometimes-dont-copy-over-to-new-drafts

At the moment, there's a bug where some attachments are not being
successfully duped when a new edition is created by the
`Edition#create_draft` method.

After some digging it's become clear that this occurs when a user inputs
questionable HMTL for tables. Specifically when they don't use a
<tbody> for the body of the table.

When a user saves a HTML attachment on Whitehall it's validates it via
the Govspeak::HtmlValidator.

This calls the Document#to_html method. When we pass `sanitize: true`
into the options we then call the `Sanitize#sanitize` method from the
`Sanitize` gem which uses the `fragment` method. Here's the code:

```
def fragment(html)
  return '' unless html

  frag = Nokogiri::HTML5.fragment(preprocess(html), **@config[:parser_options])
  node!(frag)
  to_html(frag)
end
```
https://github.com/rgrove/sanitize/blob/main/lib/sanitize.rb#L66-L68

This `Nokogiri::HTML5.fragment` has some cool, but slightly unexpected
behaviour that is outlined here
https://makandracards.com/makandra/481802-how-to-prevent-nokogiri-from-fixing-invalid-html

Essentially though it fixes up invalid HTML. One side effect is that
it can add extra linebreaks that were not there before, which is
happening in this case.

By the time the HMTL is compared with and without sanitisation in the
Validator, subtle whitespace changes have snuck which causes the `==`
operator to fail.

While we could definitely implement a bespoke fix for this issue, we've
decided to go with removing linebreaks from the html before comparing
it with and without sanitisation. It's largely just noise and could
definitely cause issues down the line again when further releases occur.
Copy link

@beccapearce beccapearce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great explanation of the issue!!! ⭐

@davidgisbey davidgisbey merged commit 40b8d75 into main Aug 12, 2022
@davidgisbey davidgisbey deleted the remove-linebreaks-when-comparing-sanitised-and-unsanitised-html branch August 12, 2022 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants