-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KSES: Allow Custom Data Attributes having dashes in their dataset name. #6429
base: trunk
Are you sure you want to change the base?
Conversation
Allow spec-compliant data-attributes in `wp_kses_attr_check()`
52a9492
to
5331865
Compare
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN:
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
841746d
to
0c7ad50
Compare
I dug into the relevant section of the spec, but I think the interesting point here is that data attributes in the browser are handled much more loosely (as already mentioned in the description). The spec seems to explicitly disallow ASCII uppercase and <div data-XYZ:ABC="works">ok</div> document.querySelector('div').attributes.getNamedItem('data-xyz:abc').nodeType === document.ATTRIBUTE_NODE
// true
document.querySelector('div').attributes.getNamedItem('data-xyz:abc').nodeName
// "data-xyz:abc"
document.querySelector('div').attributes.getNamedItem('data-xyz:abc').nodeValue
// "works"
document.querySelector('div').dataset['xyz:abc'];
// "works" specs and references
|
src/wp-includes/kses.php
Outdated
*/ | ||
if ( str_starts_with( $name_low, 'data-' ) && ! empty( $allowed_attr['data-*'] ) | ||
&& preg_match( '/^data(?:-[a-z0-9_]+)+$/', $name_low, $match ) | ||
&& preg_match( '~^data-[^=/> \\t\\f\\r\\n]+$~', $name_low, $match ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning for matching the inverse of this set of characters specifically? Is this based on the specification for attributes names in HTML?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly, because data attributes themselves are just attributes. there's a large gap of attribute names based on matching ASCII a-z. emoji can form attributes names, for example.
now it might be reasonable to ask, why add this if we don't want to support all those? but the purpose of this detector is to validate data attributes. if we add additional rules above the spec for these, we will miss validating certain data attributes we didn't anticipate.
part of me also didn't want to do this, so maybe what I wrote is wrong, because it ultimately matters most which of these will be turned into the dataset
in the browser so I'm happy to change this if we think we need to, but it seemed like they all were collected in my testing, and I didn't want to invite escape hatches based on ad-hoc pattern-matching.
tests/phpunit/tests/kses.php
Outdated
$test = '<div data-foo="foo" data-bar="bar" datainvalid="gone" data--double-dash="retained" data-trailing-dash-="allowable" data-two-hyphens="remains">Pens and pencils</div>'; | ||
$expected = '<div data-foo="foo" data-bar="bar" data--double-dash="retained" data-trailing-dash-="allowable" data-two-hyphens="remains">Pens and pencils</div>'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see cases covering data--leading
data-trailing-
and data-middle--double
cases. The --
together in the middle is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change it would be good to convert this to use a data provider and add a bunch more valid and invalid use cases.
It probably should have been a data provider all along but please do not look in to the history of this feature to figure out who didn't do that in the first place. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converted in 5294a86
this is going to be another case where it's evident that the HTML5 specification is a family of interrelated specifications for different purposes. there are different rules for what we allow producing than there are for what we demand for reading. for example, attributes aren't allowed to be created with brackets in their name, like on data attributes what I was most concerned about is matching browser behavior. we should have a direct correspondence between the attributes on the server that appear in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some early notes, will try to get to a deeper dive later this week.
tests/phpunit/tests/kses.php
Outdated
|
||
// Unexpected but recognized custom data attributes. | ||
'Only comprising a prefix' => array( 'data-', '' ), | ||
'With upper case ASCII' => array( 'data-Post-ID', 'postId' ), | ||
'With Unicode whitespace' => array( "data-\u{2003}", "\u{2003}" ), | ||
'With Emoji' => array( 'data-🐄-pasture', '🐄Pasture' ), | ||
'Brackets and colon' => array( 'data-[wish:granted]', '[wish:granted]' ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WordPress can be stricter than the specs and browser implementations (especially that latter if they differ from the former), to an extent that's the entire purpose of KSES.
I don't think accounting for special characters is overly wise as it's too easy to miss situations in which a contributor or author role can break out of the intended behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @peterwilsoncc! in the code for this PR there are two stages: the first is determining if there's a data attribute, and then the second is locking down a smaller allowable subset. these tests, and this function, is to ensure that we can all agree on what is a data attribute before we then apply further constraints.
it doesn't have to be this way, of course, but I find that it's really easy to let things slide through when we conflate the two ideas. in this particular case it looks like everything else should be eliminated in wp_kses_attr_check()
other than the specific data attributes, but I also had to double and triple check the logic to make sure we weren't letting through some data attributes only because they didn't fit the pattern we expected.
so this was done for the reason of making it less likely to let something slip through. it doesn't have to be done this way, but it was the purpose for the structuring I made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this was done for the reason of making it less likely to let something slip through. it doesn't have to be done this way, but it was the purpose for the structuring I made.
Thanks @dmsnell that makes a lot of sense. 👯
tests/phpunit/tests/kses.php
Outdated
$test = '<div data-foo="foo" data-bar="bar" datainvalid="gone" data--double-dash="retained" data-trailing-dash-="allowable" data-two-hyphens="remains">Pens and pencils</div>'; | ||
$expected = '<div data-foo="foo" data-bar="bar" data--double-dash="retained" data-trailing-dash-="allowable" data-two-hyphens="remains">Pens and pencils</div>'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change it would be good to convert this to use a data provider and add a bunch more valid and invalid use cases.
It probably should have been a data provider all along but please do not look in to the history of this feature to figure out who didn't do that in the first place. ;)
Interestingly, before this change |
@peterwilsoncc I've refactored the test in 5294a86 Also I've updated the PR description to reflect how this PR is no longer attempting to allow all legitimate custom data attributes, but still only those WordPress special-cases, plus the ones with a dash in their Of note, this is slightly different than the approach in #6598. Instead of deciding to accept or reject based on the attribute name, in this patch WordPress is basing the decision off of the translated name. This subtlety means that the test is more directly related to the name of the custom data attribute. For example, if we only test for the presence of the double dash Now, the rule to allow a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a few more notes after going through this more thoroughly.
I'm a little unclear as to the need for the new function as it's the attribute name that KSES is concerned about, the validation regex within the existing function should be for convertible names.
A function converting an attribute name to it's JavaScript equivalent may be useful as an escaping function for inline scripts but isn't directly related to KSES.
* @param bool $is_allowed Whether the given attribute should be allowed. | ||
*/ | ||
public function test_wp_kses_attr_data_attribute_is_allowed( $attribute_name, $is_allowed ) { | ||
$element = "<div {$attribute_name}>Pens and pencils.</div>"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also test "<div {$attribute_name}="populated">Pens and pencils.</div>"
-- either in this test or another one using the same data provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do! Out of curiosity, why do this, since kses
doesn't make any distinction for whether to allow a data attribute or not based on its value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why do this, since
kses
doesn't make any distinction for whether to allow a data attribute or not based on its value?
Mainly as a defensive strategy in case a future maintainer (quite possibly us) makes a mistake. The execution path of KSES is a little very confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a second test for value-containing attributes in 0498bad
src/wp-includes/kses.php
Outdated
@@ -1311,6 +1319,54 @@ function wp_kses_attr_check( &$name, &$value, &$whole, $vless, $element, $allowe | |||
return true; | |||
} | |||
|
|||
/** | |||
* If an attribute name represents a custom data attribute, return the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adds a summary so the details are all included in the description section of the developer docs (see wp_kses()
as an example).
* If an attribute name represents a custom data attribute, return the | |
* Convert attribute to JavaScript `dataset` form if allowed. | |
* | |
* If an attribute name represents a custom data attribute, return the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterwilsoncc are you recommending a different wording of the summary, or worried that the existing summary won't appear in the docs? The existing summary is up to two lines at the start of the docblock, so this will appear as-written in the patch.
For example, class_name_updates_to_attributes_updates()
also has a two-line summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mainly suggesting different, shorter, phrasing.
I thought the summaries were intended to be a one-liner but just re-read the doc standards and it turns out I was mistaken.
I'll leave this up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded in dc859f1. It may take some more rewording 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a very good fix without sacrifices. Thank you!
public static function data_possible_custom_data_attributes_and_transformed_names() { | ||
return array( | ||
// Non-custom-data attributes. | ||
'Normal attribute' => array( 'post-id', null ), | ||
'Single word' => array( 'id', null ), | ||
|
||
// Normative custom data attributes. | ||
'Normal custom data attribute' => array( 'data-post-id', 'postId' ), | ||
'Leading dash' => array( 'data--before', 'Before' ), | ||
'Trailing dash' => array( 'data-after-', 'after-' ), | ||
'Double-dashes' => array( 'data-wp-bind--enabled', 'wpBind-Enabled' ), | ||
'Double-dashes everywhere' => array( 'data--one--two--', 'One-Two--' ), | ||
'Triple-dashes' => array( 'data---one---two---', '-One--Two---' ), | ||
|
||
// Unexpected but recognized custom data attributes. | ||
'Only comprising a prefix' => array( 'data-', '' ), | ||
'With upper case ASCII' => array( 'data-Post-ID', 'postId' ), | ||
'With medial upper casing' => array( 'data-uPPer-cAsE', 'upperCase' ), | ||
'With Unicode whitespace' => array( "data-\u{2003}", "\u{2003}" ), | ||
'With Emoji' => array( 'data-🐄-pasture', '🐄Pasture' ), | ||
'Brackets and colon' => array( 'data-[wish:granted]', '[wish:granted]' ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed that (surprisingly) all these cases do seem to work with way in the browser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I didn't know how to automate them in WordPress' tests, but I don't know if you saw that I included a test script in the PR description which I used to confirm each case. In PHP I use the old detector in WordPress, the one in this patch, and then in JavaScript I set the innerText
to be the property name(s) of the available properties in an element's own dataset
, which shows how the browser transforms them.
The included screenshot is from the output of this script, and I compared all three big browsers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed the test script! Thanks for providing it 👍
I did something very similar myself. On my first attempt, I tried using DOM APIs to create an element and set the attribute. That (setAttribute
) rejects some attribute names that worked fine when setting the data attributes in HTML.
@sirreal what are your thoughts on this vs. #6598? and also about @peterwilsoncc's thoughts.
Personally I feel like I could be convinced both ways, but there's this part of me, especially after working with so many issues in WordPress' HTML handling and That is, for example, maybe today |
@dmsnell I can see it leading to confusing if the new function doesn't return
As it's a kses function it suggests support that isn't available. Generally KSES sub-functions aren't intended to be called directly but that can change over time, as we've seen with |
@peterwilsoncc 🤔 many of the would it make more sense to add this into the HTML API and then have as a reminder, my purpose is to fully vet and adapt this solution for the feedback, not to push to adopt it or choose it in the end. |
Yeah, I've got ya. We're working to the same end & it's a complex question :)
That works as the HTML API seems more about parsing without regard to kses. As the maintainer of the HTML API is that something you are happy with? If it's not something you had planned and don't think is a valid use of the feature then don't feel the need to wedge it in there on my account.
Are you able to clarify this a little more? I'm still trying to get clear on the purpose of the new function in my head and how it differs from the regex validation. |
thanks again @peterwilsoncc, and sorry for my late response, which is due to some transit that threw me off.
I'm happy to put this in there since it relates to the spec issues. Originally we had a The only challenge I see in moving into the HTML API right now is that it's not clear where to put it, and that would probably take some time to see how everyone feels. As an alternative, we could inline the code for now and move it over when we have a stronger idea where it belongs. Or I guess we could still leave it as its own function so that we can leave in the tests for it, but communicate that it's not to be used or called from the outside, and that it will probably move. As in, maybe we can keep it here for now for practical reasons and defer the bigger design question until we can answer it.
Thank you for asking. I don't mean to be unclear, but I don't always get things out very well. I could change the question here a bit. This entire patch and need to open up /**
* Allow `data-*` attributes.
*
* …
*
* Note: the attribute name should only contain `A-Za-z0-9_-` chars,
* double hyphens `--` are not accepted by WordPress.
*/ The existing code makes a claim for the purpose of the regex but then immediately contradicts that claim by specifying new rules. It's there to allow For example, should Core allow Getting back to your question, by separating the question into two steps I find it easier to disambiguate situations like these: is this a custom data attribute? is this one Core allows? We have a clear goalpost for proper parsing, so we don't have to ask ourselves if the regex was intentionally overlooking other custom data attributes vs. whether it was accidentally overlooking them. Since this entire patch is about changing what WordPress allows, we can focus on WordPress additional rules and not conflate that with how to properly detect legitimate ones. We can change our own sanitization without ever effecting the underlying parsing code. Anyway I hope I haven't made this more confusing. This is all about making it easier to talk about and verify that code we write is correct and about separating the act of detecting things from allowing some of those things we detect. It's similar to using the HTML API to find Edit: I thought of a more concrete example I came upon that might help. In the test suite there's a case of |
Thanks for clarifying. I've been considering this over the weekend and my strong inclination is to go with the source code changes in PR #6598 to allow for the lightest touch changes to land in core. Doing some quick testing, I can't see it leading to problems but it's worth coming up with additional edge cases. (Also doing quick tests, I can see that some of the browser interpretations are interesting https://jsbin.com/devefix/edit?html,js,console ) That said, if you can think of problematic cases it would lead to, I am happy to listen. It's best to know about these things pre- rather than post-commit.
If we're to validate the dataset, I'd be inclined to inline it for the time being if the intent is to deprecate in the shortish term. In the past Core has removed private delegations because it didn't reflect reality (List Tables being the poster-child for this) so we'd need to keep it around. Inlining also allows a more considered approach to inclusion in the HTML API while you figure out where best to place it. |
If you feel happy with #6598 then let's merge that in and leave this open for further consideration. The two aren't contradictory, but this one encompasses more choice. I'll approve that one, and if nobody has merged it by later today I'll try and do that. The most important thing to me is that we allow these where WordPress currently removes them for the Interactivity API's purposes. In aa63e2b I added your additional test cases. Cases like these are exactly why I approached the problem the way I did, because intuition of the mapping from HTML to JavaScript can be fuzzy, meaning the rules we apply in PHP are in high risk for being different than we think they are unless we write the rules to the output of that transformation. Thanks @peterwilsoncc! |
We should update the trac ticket for this to https://core.trac.wordpress.org/ticket/61501. This would be an enhancement targeting the 6.7 release. |
I shared some motivation for allowing more data attributes in #61501, namely Interactivity API class directives and the Tailwind CSS framework. This could result in data attributes like This PR would combine nicely with WordPress/gutenberg#65803. |
Trac ticket: Core-61501.
(was Core-61052)
Alternative in #6598
Allow for additional custom data attributes in
wp_kses_attr_check()
.In this patch the set of allowable custom data attribute names is expanded to permit those including dashes in the
dataset
name. The termdataset
name here means the name appearing to JavaScript in a browser. This is found by stripping away thedata-
prefix, lowercasing, and then turning every dash-letter combination into a capital letter.The Interactivity API relies on data attributes of the form
data-wp-bind--enabled
. The corresponding dataset name here iswpBind-Enabled
, and is currently blocked bywp_kses_attr_check()
. This patch allows that and any other data attribute through which would include those dashes in the translated name.This patch is structured in two parts:
The effect of this change is allowing a double-dash when previously only a single dash after the initial
data-
prefix was allowed. It's more complicated than this, because, as evidenced in the test suite, double dashes translate into different names based on where they appear and what follows them in the attribute name.Code used to produce this table
https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/data-*