Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor URL validation CLI script to make parts reusable #5306

Merged
merged 49 commits into from
Nov 23, 2020

Conversation

johnwatkins0
Copy link
Contributor

@johnwatkins0 johnwatkins0 commented Sep 1, 2020

Summary

This refactors the URL validation CLI script. The functions that generate URLs to validate are in the new class ScannableURLProvider, and the functions that actually validate URLs are in the new class URLValidationProvider.

This fixes #5307 and is a precursor to several open issues, including anything that involves validating URLs without the loopback request -- e.g. via cron tasks or in client-side requests via REST. In the closed PR #5228, where most of this code comes from, I attempted to refactor the CLI script and create REST wrappers and create a Cron service. In hindsight, that was doing too much, and I think it would be better to get this change reviewed and merged before doing any work that depends on it.

While most of the code in this PR is from #5228, it is in turn mostly from the CLI script. Even the PHPUnit tests were mostly copied into new classes. For that reason, the new classes are still architected with the CLI script in mind, so I believe they may evolve more as we work on the features that depend on them.

Testing

As mentioned above, most of the unit tests here are slightly modified versions of the tests covering the CLI scripts, so their passing is a big indicator that nothing has broken. For QA steps, see #5307.

Checklist

  • My pull request is addressing an open issue (please create one otherwise).
  • My code is tested and passes existing tests.
  • My code follows the Engineering Guidelines (updates are often made to the guidelines, check it out periodically).

@google-cla google-cla bot added the cla: yes Signed the Google CLA label Sep 1, 2020
@johnwatkins0 johnwatkins0 self-assigned this Sep 1, 2020
@johnwatkins0 johnwatkins0 added Validation WS:UX Work stream for UX/Front-end labels Sep 1, 2020
@johnwatkins0 johnwatkins0 marked this pull request as ready for review September 1, 2020 02:48
@johnwatkins0 johnwatkins0 added this to the v2.1 milestone Sep 1, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2020

Plugin builds for 9c627b3 are ready 🛎️!

Copy link
Collaborator

@schlessera schlessera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement. I think the naming should be revised between ValidationURLProvider and URLValidationProvider, as these are almost guaranteed to confuse people. I think you've even confused yourself in the tests, as you've used the constructor from one on the other... ;)

There are lots of suggestions in the review, but a large part of it is just nit-picking.

*
* @since 2.1
*/
final class ValidationURLProvider {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid confusion with the UrlValidationProvider, I suggest renaming this to ScannableUrlProvider.

src/Validation/ValidationURLProvider.php Outdated Show resolved Hide resolved
src/Validation/ValidationURLProvider.php Outdated Show resolved Hide resolved
src/Validation/ValidationURLProvider.php Outdated Show resolved Hide resolved
src/Validation/ValidationURLProvider.php Outdated Show resolved Hide resolved
tests/php/src/Validation/ValidationURLProviderTest.php Outdated Show resolved Hide resolved

/** @coversDefaultClass URLValidationProvider */
final class URLValidationProviderTest extends WP_UnitTestCase {
use PrivateAccess, AssertContainsCompatibility;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, both of these traits are not needed, as you're also not using assertStringContains().

Suggested change
use PrivateAccess, AssertContainsCompatibility;

tests/php/test-class-amp-cli-validation-command.php Outdated Show resolved Hide resolved
tests/php/src/Validation/URLValidationProviderTest.php Outdated Show resolved Hide resolved
@@ -8,11 +8,14 @@
use AmpProject\AmpWP\Option;
use AmpProject\AmpWP\Tests\Helpers\AssertContainsCompatibility;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use AmpProject\AmpWP\Tests\Helpers\AssertContainsCompatibility;

This is not needed.

You need to remove the use statement for this as well below here, I can't reach it with the PR review tools.

@google-cla
Copy link

google-cla bot commented Oct 7, 2020

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot removed the cla: yes Signed the Google CLA label Oct 7, 2020
@johnwatkins0
Copy link
Contributor Author

@pierlon @westonruter @schlessera I've addressed all feedback and gotten the new GHA workflow to pass. This is ready for another round of review.

My not-quite-finished PR adding the validation cron events (#5515) depends on this.

@johnwatkins0 johnwatkins0 requested a review from pierlon November 10, 2020 23:15
Copy link
Contributor

@pierlon pierlon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some misc. changes to be made and then all should be well.

*/
public $force_crawl_urls = false;
private $validation_url_provider;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if $validation_url_provider and $validation_provider are misnomers.

I would expect $validation_url_providerto be named something like $scannable_url_provider since it is an instance of ScannableURLProvider. Likewise, I would expect $validation_providerto be named something like $url_validation_provider since it is an instance of URLValidationProvider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. This was a product of shifting class names. Updated in 4c0238d

includes/cli/class-amp-cli-validation-command.php Outdated Show resolved Hide resolved
tests/php/src/Validation/ScannableURLProviderTest.php Outdated Show resolved Hide resolved
@johnwatkins0 johnwatkins0 requested a review from pierlon November 11, 2020 02:48
@johnwatkins0
Copy link
Contributor Author

Thanks, @pierlon. Updated and re-requesting review.

Copy link
Contributor

@pierlon pierlon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!

…efactor-url-validation

* 'develop' of github.com:ampproject/amp-wp: (33 commits)
  Ensure global post is set for test_decorate_shortcode_and_filter_source
  Extract to plugins dir when unzipping PWA plugin
  Add tests for new markup
  Use fixed layout for amp-story-360 example
  Undo intrinsic layout for video since WP core uses responsive
  Use intrinsic layout of video
  Update amphtml spec to 2011140244000
  Update stable tag and tested up to
  Bump stable tag to 2.0.6; update docs
  Bump autoprefixer from 10.0.1 to 10.0.2
  Bump babel-loader from 8.1.0 to 8.2.1
  Setup composer via the shivammathur/setup-php action
  Bump core-js from 3.6.5 to 3.7.0
  Bump copy-webpack-plugin from 6.3.0 to 6.3.1
  Remove deprecated option "--no-suggest"
  Update composer.lock
  Bump eslint-plugin-jest from 24.1.0 to 24.1.3
  Bump eslint from 7.12.1 to 7.13.0
  Bump @wordpress/block-editor from 5.1.2 to 5.1.3
  Bump mini-css-extract-plugin from 1.2.1 to 1.3.1
  ...
Comment on lines 143 to 152
$result = $url_validation_provider->with_lock(
function () {
$this->validate_urls();
}
);

if ( is_wp_error( $result ) ) {
WP_CLI::error( 'The site cannot be crawled at this time because validation is running in another process.' );
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why is a lock necessary? What problems will occur if validation is happening in parallel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the validate_urls method ultimately leads to AMP_Validation_Manager::validate_url, which is an expensive function using wp_remote_get. In theory, without the lock, if someone runs the CLI command (or even multiple instances of the command) there could be multiple processes running a string of wp_remote_get calls at the same time. I suspect some hosting providers wouldn't like that. Do you think the lock is overcautious? Removing it would certainly simplify things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to note here, the lock is looking ahead to having the cron task in place, as well as any URL validation taking place within the plugins settings screen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm still not sure on the need for locking but we can revisit that later. It may be preferable for any new invocation of the validation provider to abort any existing validation tasks. Then you wouldn't have the experience of validation being performed in cron, and then being blocked when attempting to initiate validation from the admin screen.

Nevertheless, validation happening concurrently generally should not have negative consequences other than additional server load. Even if the same URL is being validated at the exact same time, what would go wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be following up shortly with the cron PR that uses the lock more. That will be a good time to revisit. My fundamental assumption is that hosting platforms like VIP might see multiple background processes running wp_remote_get and not liking that too much, but I might be overcomplicating things because I'm sure this is not the only plugin that does things like that.


$number_urls_to_crawl = $this->count_urls_to_validate();
$number_urls_to_crawl = count( $scannable_url_provider->get_urls() );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the obtained URLs be stored in a variable and check the count()? The URLs could then be passed down to the validate_urls() method. Otherwise, as it stands right now the get_urls() method is being called twice, with the same data (or not) being queried twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in de33111

*
* @param int $timeout Time in seconds. Default 300 seconds.
*/
return apply_filters( 'amp_validation_lock_timeout', 5 * MINUTE_IN_SECONDS );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for the filter? Should this be @internal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I don't actually see an immediate use case for this filter, so I removed it in favor of a class constant in 13917b3

src/Validation/URLValidationProvider.php Outdated Show resolved Hide resolved
src/Validation/URLValidationProvider.php Outdated Show resolved Hide resolved
*
* @param string $url The URL to validate.
* @param string $type The type of template, post, or taxonomy.
* @param string $flag Flag determining whether the URL should be revalidated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A flag makes me think it is going to be a bitmask. Maybe $mode would be a better name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or actually, passing an arg for FLAG_FORCE_REVALIDATE vs FLAG_NO_REVALIDATE behavior may not be the most ergonomic. What if there were other methods for the other behaviors that wrap this one? If the flag is empty, then what is the expected behavior? Validate if it doesn't already exist? I think I see that now, but I had to look at it awhile to understand how it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised this in d87ad9d. My solution for now is to simply use a third argument for $force_revalidate. The reason for this is I found FLAG_NO_REVALIDATE is not being used on this branch at all. The no-revalidate use case came up in one of the subsequent features that depend on this branch, and if needed I will revisit when I get back to that feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great.

tests/php/src/Helpers/ValidationRequestMocking.php Outdated Show resolved Hide resolved
tests/php/test-class-amp-cli-validation-command.php Outdated Show resolved Hide resolved
@westonruter
Copy link
Member

(Sorry for the delay in reviewing.)

@westonruter westonruter merged commit 4e00fc5 into develop Nov 23, 2020
@westonruter westonruter deleted the feature/refactor-url-validation branch November 23, 2020 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Signed the Google CLA Validation WS:UX Work stream for UX/Front-end
Projects
None yet
Development

Successfully merging this pull request may close these issues.

URL validation CLI script contains logic that should be made available outside the CLI script
5 participants