Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LS-81] fix(markdown-utils): change sanitization process + add unescape #718

Merged
merged 7 commits into from
Apr 20, 2023

Conversation

seaerchin
Copy link
Contributor

@seaerchin seaerchin commented Apr 19, 2023

Problem

Previously, there was inconsistent behaviour caused by sanitization on the backend. This is because of dompurify's sanitization config, where it will automatically html encode certain special characters if it detects that there is a html tag.

This process affects not just content within the tag, but the string as a whole. For example,

const x = "& <b>&something</b>"

will have both ampersands encoded even though the first one is outside of the b-tag.

Closes LS-81

Solution

In order to make sure that sanitization takes place properly, this PR establishes an invariant, as follows:
frontmatter content is never html encoded.

This is chosen over html encoding all content in our frontmatter due to the existence of the 3 special pages (homepage/nav/contact-us), where users are able to input html (:sadge:)

In order to preserve this property, a few rules have to be followed (done alr at present)

  1. devs never call sanitize directly for markdown files but through a given interface (<convert|retrieve>DataFromMarkdown).
  2. the respective conversion/retrieval functions separately sanitize frontmatter/body
  3. we run unescape after sanitizing frontmatter.

This allows us

Testing

  • navigate to the CMS and choose a testing site
  • make the page have special character in frontmatter (can be done through editing the page directly on github if lazy)
  • rename a page to have a special character in the title
  • check that clicking settings on the page does not show a html encoded title

Notes
This change might be destructive - existing pages w html encoded properties in their frontmatter will have them unescaped. I'm not entirely sure how this impacts things like permalink etc but it's also possible to guarantee only for the title property

@seaerchin seaerchin requested a review from a team April 19, 2023 10:28
@seaerchin
Copy link
Contributor Author

@alexanderleegs tagging you separately in case there is anything that isn't back-compat

Copy link
Contributor

@kishore03109 kishore03109 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change might be destructive - existing pages w html encoded properties in their frontmatter will have them unescaped. I'm not entirely sure how this impacts things like permalink etc but it's also possible to guarantee only for the title property

hmm, if we are not sure the possible implications of this, is there a reason why we don't make this change only for the title property then to reduce surface area of bugs?

// so this does not do anything destructive.
// Do note that frontmatter containing pre-existing html encoded characters (&amp;)
// will get transformed regardless.
(val) => _.unescape(val)
Copy link
Contributor

@kishore03109 kishore03109 Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This solution escapes all html characters, wdyt about only escaping &, since that is the most common case that we are encountering at the moment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the common case isn't the only case - this means that the same bug can appear, just with a different character that's escaped. in the event that it happens, we'd have to expend eng resources to dig through + fix so i'd rather just escape all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think there are going to be additional security concerns that we might have with allowing all html for all frontmatter?
Considering we have CSP headers + this is already the case for some pages, it seems ok, but just checking in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR doesn't allow/disallow html, it just escapes encoded html present in frontmatter. the sanitization invariant is still preserved (front matter still sanitised)

Copy link
Contributor

@alexanderleegs alexanderleegs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2023-04-20 at 11 29 45 AM

creating new pages creates this commented out thing at the moment!

@alexanderleegs
Copy link
Contributor

@alexanderleegs tagging you separately in case there is anything that isn't back-compat

I think this is fine, we previously didn't do any html encoding for the special pages either as far as i know?

@seaerchin
Copy link
Contributor Author

Screenshot 2023-04-20 at 11 29 45 AM

creating new pages creates this commented out thing at the moment!

does this actually impact editing experience? this is injected due to sanitize encountering an empty body and injecting it into a document. if it's concerning, we could always avoid sanitization if it's an empty string

@seaerchin seaerchin requested review from a team, kishore03109 and alexanderleegs April 20, 2023 03:46
Copy link
Contributor

@kishore03109 kishore03109 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

does this actually impact editing experience? this is injected due to sanitize encountering an empty body and injecting it into a document. if it's concerning, we could always avoid sanitization if it's an empty string

this does seem confusing at the first glance... could we add a test case to show this behaviour is expected?

@alexanderleegs
Copy link
Contributor

Screenshot 2023-04-20 at 11 29 45 AM creating new pages creates this commented out thing at the moment!

does this actually impact editing experience? this is injected due to sanitize encountering an empty body and injecting it into a document. if it's concerning, we could always avoid sanitization if it's an empty string

Could we put in the check for empty string then? As a user creating a new page, having something you didn't input immediately show up in your editing view probably isn't ideal

@seaerchin seaerchin merged commit 0f10b20 into develop Apr 20, 2023
@seaerchin seaerchin deleted the IS-81/fix/third-nav-title branch April 20, 2023 06:36
@alexanderleegs alexanderleegs mentioned this pull request Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants