Task: Re-examine the event locks implementation #233

WPprodigy · 2021-12-17T19:03:07Z

1) Prevent OOM requests from perpetually consuming a lock slot.

Problem example: There are 10 lock slots for concurrent event running. Event A OOMs and the process is killed off higher up the chain than at the request-level. The plugin never gets to free up the lock, so now the site-level lock is perpetually stuck with just 9 slots. This slowly gets worse and worse if OOMing events keep running every once in a while. Eventually, maybe, all 10 slots would be deadlocked and then the cache key will expire since it is no longer being updated. But this can take a while, and cron becomes very backed up in the process.

Possible solution: Mentioned here, instead of the lock being 1 cache key with incr/decr, each lock slot could be it's own key w/ it's own timestamp. This way we can free up locks individually when problems happen.

Another possible solution: Use a DB table for locks instead - #84

Another possible solution: Have the Go runner catch a kill signal and it could send a subsequent cron-control cleanup event lock in those cases. Downside being it's an extra request, and only solves for the go runner, but is a way to at least be notified about when this happens (vs relying on a cache timeout that may or may not be an accurate representation of how long an event can run successfully).

2) Increase lock defaults

A default of 10 site-level concurrency feels a bit conservative - it should just be up to the runner implementation to decide on scale, we don't need to do it at the application level as well IMO.
Perhaps even default to unlimited event-level concurrency. Instead of a whitelist, events that cannot run concurrently would instead add themselves to a disallow filter.

3) Drop locks completely?

Following the chain of thought from the point above, maybe we just drop the site-level concurrency locks completely. And event-level concurrency would default to "unlimited", but keep a simplified implementation that is used for events where concurrency is specifically disallowed.

The text was updated successfully, but these errors were encountered:

WPprodigy · 2022-01-24T04:59:11Z

An additional problem worth noting / solving:

4) The same exact event can be run multiple times concurrently.

At the moment we complete/reschedule an event right before running it. This is what core WP does, but problem is that core WP doesn't do concurrent event running. So this causes problems where the same exact event is run concurrently multiple times. Two examples:

Example 1: Say an event has a 60 second recurring schedule but can run for up to say 2 minutes. So cron control will pick up the event, reschedule it for +60 seconds, and start processing it. In a minute the event is due again, but the first event is still running. Cron control will pick it up again and run it as well. Now we have the same event running twice.

Example 2: The application repeatedly schedules a single event to run if it's not already scheduled with something like this:

if ( ! wp_next_scheduled( self::CRON_JOB_NAME, [ $process_id ] ) ) {
  wp_schedule_single_event( $now, self::CRON_JOB_NAME, [ $process_id ] );
}

At the moment, wp_next_scheduled() will fail even if the event is currently processing.

Solution: Set events to the running status when they are actively running. This will both prevent list-due-batch from returning events that are still processing, and will help prevent events from being registered when their twin is still running.

WPprodigy · 2022-01-24T05:42:50Z

Did a lot of testing and thinking about all the above. And I think we can solve pretty much all of the problems by setting events to the running status during processing. Here is what I'm thinking:

Mark events as "running" during event processing

Pretty straightforward idea. When an event is picked up to be run, we can do a query like this: UPDATE table SET status = 'running' WHERE ID = 123 AND status = 'pending'.

If two requests try to process the same event at the same time, only 1 request will have a successful result from that query, so we know who has the right to continue processing and who should abort.
Throughout the codebase we count pending & running events as "active statuses". Meaning the event is still "registered", so application-code won't try to register the same event again until it is done processing.
list-due-batch CLI will only return pending events, so long-running but short-schedule recurring events for example won't have duplicate concurrent runs.

Event/action-level concurrency locks could be simplified to either allowed or disallowed concurrency.

Instead of whitelisting actions with a number of how many allowed concurrent runs it can have, just have a boolean to signify "allowed_concurrency" or "disallowed_concurrency". We can keep backwards compatibly by checking the current whitelist filter and if the number is greater than 1 then the action has "allowed_concurrency". This keeps the promise that there is no same-action event concurrency by default, but events marked as "concurrent safe" no longer need the lock.

This allows us to greatly simplify and just use wp_cache_add/wp_cache_delete with a single key per action, removing a few concurrency issues that currently exist.
Deadlocks will only happen with concurrency-unsafe events, and the deadlocks will clear themselves up since the cache can be added with a timeout on it.
We don't need to really worry about 1 action hogging up the queue, because we already split up events so that each action has a chance to be returned in list-due-batch before a duplicate action is added.

Site-level concurrency can also be simplified (or removed)

With events marked as running when they are processing, we'll just need to query for events that have that status. It will give us the count of events currently processing. So if count( $currently_running_events ) >= SITE_CONCURRENCY_LIMIT, we can just abort.

This helps remove the problem where 1 recurring & OOMing event slowly chips away at available lock slots.
It won't deadlock as often, because we can cleanup running events whose last_modified timestamp is beyond a realistic run time. Each "lock" essentially has its own timer on it, without the need for an extra database table.
Admittedly it does allow for some race conditions wherein we'll end up exceeding the limit when multiple events are starting up at the same time.

Ultimately though, I still think we should remove the site-level lock entirely. The runner is already responsible for how many runWorkers are available.

This was referenced Dec 17, 2021

WIP: Switch locks to individual cache keys #126

Closed

Lock: use DB? #84

Closed

Lock: Add tests for Lock class #3

Closed

Remove hardcoded limit in manage-event-lock CLI command #172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task: Re-examine the event locks implementation #233

Task: Re-examine the event locks implementation #233

WPprodigy commented Dec 17, 2021

WPprodigy commented Jan 24, 2022

WPprodigy commented Jan 24, 2022 •

edited

Loading

Task: Re-examine the event locks implementation #233

Task: Re-examine the event locks implementation #233

Comments

WPprodigy commented Dec 17, 2021

1) Prevent OOM requests from perpetually consuming a lock slot.

2) Increase lock defaults

3) Drop locks completely?

WPprodigy commented Jan 24, 2022

4) The same exact event can be run multiple times concurrently.

WPprodigy commented Jan 24, 2022 • edited Loading

Mark events as "running" during event processing

Event/action-level concurrency locks could be simplified to either allowed or disallowed concurrency.

Site-level concurrency can also be simplified (or removed)

WPprodigy commented Jan 24, 2022 •

edited

Loading