Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task: Re-examine the event locks implementation #233

Open
WPprodigy opened this issue Dec 17, 2021 · 2 comments
Open

Task: Re-examine the event locks implementation #233

WPprodigy opened this issue Dec 17, 2021 · 2 comments

Comments

@WPprodigy
Copy link
Contributor

1) Prevent OOM requests from perpetually consuming a lock slot.

Problem example: There are 10 lock slots for concurrent event running. Event A OOMs and the process is killed off higher up the chain than at the request-level. The plugin never gets to free up the lock, so now the site-level lock is perpetually stuck with just 9 slots. This slowly gets worse and worse if OOMing events keep running every once in a while. Eventually, maybe, all 10 slots would be deadlocked and then the cache key will expire since it is no longer being updated. But this can take a while, and cron becomes very backed up in the process.

Possible solution: Mentioned here, instead of the lock being 1 cache key with incr/decr, each lock slot could be it's own key w/ it's own timestamp. This way we can free up locks individually when problems happen.

Another possible solution: Use a DB table for locks instead - #84

Another possible solution: Have the Go runner catch a kill signal and it could send a subsequent cron-control cleanup event lock in those cases. Downside being it's an extra request, and only solves for the go runner, but is a way to at least be notified about when this happens (vs relying on a cache timeout that may or may not be an accurate representation of how long an event can run successfully).

2) Increase lock defaults

  • A default of 10 site-level concurrency feels a bit conservative - it should just be up to the runner implementation to decide on scale, we don't need to do it at the application level as well IMO.
  • Perhaps even default to unlimited event-level concurrency. Instead of a whitelist, events that cannot run concurrently would instead add themselves to a disallow filter.

3) Drop locks completely?

Following the chain of thought from the point above, maybe we just drop the site-level concurrency locks completely. And event-level concurrency would default to "unlimited", but keep a simplified implementation that is used for events where concurrency is specifically disallowed.

@WPprodigy
Copy link
Contributor Author

An additional problem worth noting / solving:

4) The same exact event can be run multiple times concurrently.

At the moment we complete/reschedule an event right before running it. This is what core WP does, but problem is that core WP doesn't do concurrent event running. So this causes problems where the same exact event is run concurrently multiple times. Two examples:

Example 1: Say an event has a 60 second recurring schedule but can run for up to say 2 minutes. So cron control will pick up the event, reschedule it for +60 seconds, and start processing it. In a minute the event is due again, but the first event is still running. Cron control will pick it up again and run it as well. Now we have the same event running twice.

Example 2: The application repeatedly schedules a single event to run if it's not already scheduled with something like this:

if ( ! wp_next_scheduled( self::CRON_JOB_NAME, [ $process_id ] ) ) {
  wp_schedule_single_event( $now, self::CRON_JOB_NAME, [ $process_id ] );
}

At the moment, wp_next_scheduled() will fail even if the event is currently processing.

Solution: Set events to the running status when they are actively running. This will both prevent list-due-batch from returning events that are still processing, and will help prevent events from being registered when their twin is still running.

@WPprodigy
Copy link
Contributor Author

WPprodigy commented Jan 24, 2022

Did a lot of testing and thinking about all the above. And I think we can solve pretty much all of the problems by setting events to the running status during processing. Here is what I'm thinking:

Mark events as "running" during event processing

Pretty straightforward idea. When an event is picked up to be run, we can do a query like this: UPDATE table SET status = 'running' WHERE ID = 123 AND status = 'pending'.

  • If two requests try to process the same event at the same time, only 1 request will have a successful result from that query, so we know who has the right to continue processing and who should abort.
  • Throughout the codebase we count pending & running events as "active statuses". Meaning the event is still "registered", so application-code won't try to register the same event again until it is done processing.
  • list-due-batch CLI will only return pending events, so long-running but short-schedule recurring events for example won't have duplicate concurrent runs.

Event/action-level concurrency locks could be simplified to either allowed or disallowed concurrency.

Instead of whitelisting actions with a number of how many allowed concurrent runs it can have, just have a boolean to signify "allowed_concurrency" or "disallowed_concurrency". We can keep backwards compatibly by checking the current whitelist filter and if the number is greater than 1 then the action has "allowed_concurrency". This keeps the promise that there is no same-action event concurrency by default, but events marked as "concurrent safe" no longer need the lock.

  • This allows us to greatly simplify and just use wp_cache_add/wp_cache_delete with a single key per action, removing a few concurrency issues that currently exist.
  • Deadlocks will only happen with concurrency-unsafe events, and the deadlocks will clear themselves up since the cache can be added with a timeout on it.
  • We don't need to really worry about 1 action hogging up the queue, because we already split up events so that each action has a chance to be returned in list-due-batch before a duplicate action is added.

Site-level concurrency can also be simplified (or removed)

With events marked as running when they are processing, we'll just need to query for events that have that status. It will give us the count of events currently processing. So if count( $currently_running_events ) >= SITE_CONCURRENCY_LIMIT, we can just abort.

  • This helps remove the problem where 1 recurring & OOMing event slowly chips away at available lock slots.
  • It won't deadlock as often, because we can cleanup running events whose last_modified timestamp is beyond a realistic run time. Each "lock" essentially has its own timer on it, without the need for an extra database table.
  • Admittedly it does allow for some race conditions wherein we'll end up exceeding the limit when multiple events are starting up at the same time.

Ultimately though, I still think we should remove the site-level lock entirely. The runner is already responsible for how many runWorkers are available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant