-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent schema.org duplicates #992
Prevent schema.org duplicates #992
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
@@ -14,7 +14,6 @@ function amp_post_template_init_hooks() { | |||
add_action( 'amp_post_template_head', 'amp_post_template_add_scripts' ); | |||
add_action( 'amp_post_template_head', 'amp_post_template_add_fonts' ); | |||
add_action( 'amp_post_template_head', 'amp_post_template_add_boilerplate_css' ); | |||
add_action( 'amp_post_template_head', 'amp_print_schemaorg_metadata' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be removed. The legacy post templates should remain as-is, I believe. The ensure_required_markup
method below does not apply for the legacy post template actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @westonruter
Thanks for your feedback. I corrected that mistake and restored the line.
* @param boolean $expected The expected result. | ||
*/ | ||
public function test_schema_org_present( $script, $expected ) { | ||
$page = '<html><head><script>%s</script></head><body>Test</body></html>'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The <script>
here is missing the type="application/ld+json"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good observation, I added the type attribute on the new tests for ensure_required_markup()
includes/class-amp-theme-support.php
Outdated
public static function schema_org_present( DOMDocument $dom ) { | ||
$head = $dom->getElementsByTagName( 'head' )->item( 0 ); | ||
foreach ( $head->getElementsByTagName( 'script' ) as $script ) { | ||
if ( 1 === preg_match( '/{"@context":"https?:[\\\]\/[\\\]\/schema\.org/', $script->nodeValue ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before looking at the content of the element ($script->nodeValue
) I think it should first check 'application/ld+json' === $script->getAttribute( 'type' )
.
Also, I think the 1 ===
should be removed. As long as preg_match()
returns a truthy value that's all we really care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the integer comparison and added your suggestion, indeed it makes it more precise.
includes/class-amp-theme-support.php
Outdated
* @param DOMDocument $dom Representation of the document. | ||
* @return bool | ||
*/ | ||
public static function schema_org_present( DOMDocument $dom ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this to is_schema_org_metadata_present
. But actually, I think maybe the logic should just be inlined in ensure_required_markup
like the other checks are done. True that it may be better design to break up the logic into separate methods, but since the existing logic is not in methods then I don't know if we should split out this one separately.
If you changed ensure_required_markup
from protected to be public and then added the logic from schema_org_present
to it, along with unit tests for ensure_required_markup
, then that would be great. Later we may very well want to break up ensure_required_markup
into smaller methods, but I think that would probably be done along with breaking up the AMP_Theme_Support
class into separate classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added your suggestions.
I also did the test for ensure_required_markup, however, I only made tests for the schema.org script part. Could you use tests for the entire function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this method needs more tests, so if that's something you want to add that would be great.
includes/class-amp-theme-support.php
Outdated
@@ -829,6 +828,13 @@ protected static function ensure_required_markup( DOMDocument $dom ) { | |||
) ); | |||
$head->insertBefore( $meta_viewport, $meta_charset->nextSibling ); | |||
} | |||
if ( ! self::schema_org_present( $dom ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted below, I think this logic should be inlined (for now) as it is done above for $meta_charset
and $meta_viewport
and $rel_canonical
below.
Check if schema.org script is present. Add schema.org only if it doesn't exist.
includes/class-amp-theme-support.php
Outdated
// Prevent schema.org duplicates. | ||
$schema_org_script = null; | ||
foreach ( $head->getElementsByTagName( 'script' ) as $script ) { | ||
if ( 'application/ld+json' === $script->getAttribute( 'type' ) && preg_match( '/{"@context":"https?:[\\\]\/[\\\]\/schema\.org/', $script->nodeValue ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regular expression here could be brittle. There's no guarantee that the @context
will be the first key, and there's no guarantee there wouldn't be whitespace between the strings. I think doing a simple false !== strpos( $script->nodeValue, 'schema.org' )
could be sufficient here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may not even need to look at the value. Just checking if the type is application/ld+json
may be just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @westonruter
I opted to search for both the attribute and the 'schema.org' substring as you mentioned it should check for any JSON-LD script in the head which mentions 'schema.org' in any form.
includes/class-amp-theme-support.php
Outdated
$schema_org_script = null; | ||
foreach ( $head->getElementsByTagName( 'script' ) as $script ) { | ||
if ( 'application/ld+json' === $script->getAttribute( 'type' ) && preg_match( '/{"@context":"https?:[\\\]\/[\\\]\/schema\.org/', $script->nodeValue ) ) { | ||
$schema_org_script = $script->nodeValue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the value isn't being used, it may be better to just store a flag like $has_schema_org_metadata = true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it is true, I made the corresponding correction.
includes/class-amp-theme-support.php
Outdated
} | ||
if ( ! $schema_org_script ) { | ||
$script = $dom->createElement( 'script', wp_json_encode( amp_get_schemaorg_metadata() ) ); | ||
AMP_DOM_Utils::add_attributes_to_node( $script, array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're using $dom->createElement()
here then I think it makes more sense to just do $script->setAttribute( 'type', 'application/ld+json' )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also true. Corrected it.
$dom = new DOMDocument(); | ||
$dom->loadHTML( $page ); | ||
AMP_Theme_Support::ensure_required_markup( $dom ); | ||
$this->assertEquals( substr_count( $dom->saveHTML(), $schema_test_value ), 1 ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should add a test for when JSON is output but the slashes are not escaped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And add a test for when @context
is not the first key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made tests for when @ anothercontext is the first key and has schema.org present, I hope I understood correctly.
One more higher-level question: when I first raised this issue I was thinking that there could only be one such metadata script in the document. However, apparently it's fine to have multiple: https://stackoverflow.com/questions/30723531/best-json-ld-practices-using-multiple-script-elements The main concern here for the AMP plugin is to allow for other plugins to take control of the generation of the Schema.org metadata instead of the defaults that the AMP plugin outputs, but then to have the AMP plugin output a good fallback when others don't do so. I'm feeling like if there is any JSON-LD |
includes/class-amp-theme-support.php
Outdated
$schema_org_script = null; | ||
foreach ( $head->getElementsByTagName( 'script' ) as $script ) { | ||
if ( 'application/ld+json' === $script->getAttribute( 'type' ) && preg_match( '/{"@context":"https?:[\\\]\/[\\\]\/schema\.org/', $script->nodeValue ) ) { | ||
$schema_org_script = $script->nodeValue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't this do json_decode( $script->nodeValue, false )
and check if wp_parse_url( $script['@context'], PHP_URL_HOST ) === 'schema.org
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @westonruter,
Trying to use wp_parse_url
with json_decode( $script->nodeValue, false )
breaks the site, but it works fine when $script->nodeValue
is converted to an array. The logic however stops working.
I've found that if you comment out these lines in class-amp-dom-utils.php
, the logic works again:
$placeholders = self::get_mustache_tag_placeholders();
$document = str_replace(
array_keys( $placeholders ),
array_values( $placeholders ),
$document
);
I opted to keep your previous suggestions since i don't know how important are those lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think this makes sense. If I have some JSON that looks like:
{"foo": {"bar":}}
The last }}
there is going to get replaced with a placeholder (since this token is returned by get_mustache_tag_placeholders
) before parsing the DOM, to then get put back in AMP_DOM_Utils::get_content_from_dom_node()
. That means we cannot parse JSON from the parsed DOM currently, which is not ideal. To fix this we'd need to limit the replacements to only happen inside of href
and src
attributes only, using something like this:
diff --git a/includes/utils/class-amp-dom-utils.php b/includes/utils/class-amp-dom-utils.php
index 5288c04..f20fba1 100644
--- a/includes/utils/class-amp-dom-utils.php
+++ b/includes/utils/class-amp-dom-utils.php
@@ -77,9 +77,15 @@ class AMP_DOM_Utils {
* elements since it is faster and wouldn't change the outcome.
*/
$placeholders = self::get_mustache_tag_placeholders();
- $document = str_replace(
- array_keys( $placeholders ),
- array_values( $placeholders ),
+ $document = preg_replace_callback(
+ '#\s(src|href)=("[^"]+"|\'[^\']+\')#s',
+ function( $matches ) use ( $placeholders ) {
+ return str_replace(
+ array_keys( $placeholders ),
+ array_values( $placeholders ),
+ $matches[0]
+ );
+ },
$document
);
That would probably be something good to add here to ensure we can read JSON out of the DOM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8b96d77
to
c66bfc4
Compare
Request For Review
Hi @westonruter,
Could you use this PR, which addresses #962?
Originally, I used json_decode( $script->nodeValue ) to convert the schema.org json data to an array. However, that brought some issues with get_mustache_tag_placeholders(), as it adds some extra markup to the schema.org script tag.
I opted to use preg_match() as I saw the schema.org URL sometimes can either start with http or https.
Fixes #962.