-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LS-81] fix(markdown-utils): change sanitization process + add unescape #718
Conversation
@alexanderleegs tagging you separately in case there is anything that isn't back-compat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change might be destructive - existing pages w html encoded properties in their frontmatter will have them unescaped. I'm not entirely sure how this impacts things like permalink etc but it's also possible to guarantee only for the title property
hmm, if we are not sure the possible implications of this, is there a reason why we don't make this change only for the title property then to reduce surface area of bugs?
// so this does not do anything destructive. | ||
// Do note that frontmatter containing pre-existing html encoded characters (&) | ||
// will get transformed regardless. | ||
(val) => _.unescape(val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This solution escapes all html characters, wdyt about only escaping &
, since that is the most common case that we are encountering at the moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the common case isn't the only case - this means that the same bug can appear, just with a different character that's escaped. in the event that it happens, we'd have to expend eng resources to dig through + fix so i'd rather just escape all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think there are going to be additional security concerns that we might have with allowing all html for all frontmatter?
Considering we have CSP headers + this is already the case for some pages, it seems ok, but just checking in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR doesn't allow/disallow html, it just escapes encoded html present in frontmatter. the sanitization invariant is still preserved (front matter still sanitised)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, we previously didn't do any html encoding for the special pages either as far as i know? |
does this actually impact editing experience? this is injected due to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
does this actually impact editing experience? this is injected due to sanitize encountering an empty body and injecting it into a document. if it's concerning, we could always avoid sanitization if it's an empty string
this does seem confusing at the first glance... could we add a test case to show this behaviour is expected?
creating new pages creates this commented out thing at the moment! Could we put in the check for empty string then? As a user creating a new page, having something you didn't input immediately show up in your editing view probably isn't ideal |
Problem
Previously, there was inconsistent behaviour caused by sanitization on the backend. This is because of dompurify's sanitization config, where it will automatically html encode certain special characters if it detects that there is a html tag.
This process affects not just content within the tag, but the string as a whole. For example,
will have both ampersands encoded even though the first one is outside of the b-tag.
Closes LS-81
Solution
In order to make sure that sanitization takes place properly, this PR establishes an invariant, as follows:
frontmatter content is never html encoded.
This is chosen over html encoding all content in our frontmatter due to the existence of the 3 special pages (homepage/nav/contact-us), where users are able to input html (:sadge:)
In order to preserve this property, a few rules have to be followed (done alr at present)
sanitize
directly for markdown files but through a given interface (<convert|retrieve>DataFromMarkdown
).unescape
after sanitizing frontmatter.This allows us
Testing
Notes
This change might be destructive - existing pages w html encoded properties in their frontmatter will have them unescaped. I'm not entirely sure how this impacts things like
permalink
etc but it's also possible to guarantee only for thetitle
property