Potential locking race condition when using cron scheduler across multiple processes #731

mitchellhenke · 2022-10-23T18:34:54Z

I've typically only seen this under load-testing scenarios with many job processes, but Postgres was spending long time in pg_advisory_lock. I was only able to find pg_advisory_lock being used in GoodJob for cron-scheduled jobs (here and here). It looked like the other locking calls used the non-blocking pg_try_advisory_lock.

The time spent getting the lock seems to primarily be from the around_enqueue lock when a new cron job is created and then all of the processes try to lock the cron key to see if it is enqueued, and they end up waiting for one another.

The best way I've been able to reproduce it locally is have a frequently running cron-job (I chose every minute), and then running good_job start for the sample application in a handful of terminal panes (6 was sufficient on my machine). I have a branch/commit here. With 6 running good job processes, the average time spent calling pg_advisory_lock was around 33ms according to pg_stat_statements, and I suspect it grows as you add processes/cron jobs.

I've only used GoodJob with cron jobs where the concurrency total_limit is 1, so I'm not sure on a good way to help when using limits larger than that, but I have a small idea for the 1 case at least. Would it be feasible to switch to pg_try_advisory_lock if the limit is 1? I think there's no need to check the current concurrency for performing/enqueuing if someone else has the lock, so it should be fine for the current process to let it go? I've pushed up a branch with that idea here.

I'm a bit out of my depth on concurrency stuff like this, so do feel free to let me know if I'm off-base. Thank you again!

The text was updated successfully, but these errors were encountered:

bensheldon · 2022-10-24T04:15:22Z

@mitchellhenke thanks for digging into this! It's complicated 😄

First, do you need to be using concurrency controls with total_limit: 1? There was a point in GoodJob's history where apps had to use concurrency controls to prevent multiple cron-processes from enqueuing; that stopped being necessary with #423 (v2.5.0).

Would it be feasible to switch to pg_try_advisory_lock if the limit is 1? I think there's no need to check the current concurrency for performing/enqueuing if someone else has the lock, so it should be fine for the current process to let it go?

I think it might be troublesome. Just because a key (job or cron) is locked doesn't necessarily mean the lock will be exited successfully nor do I want to assume the has been taken by another actor for the same purpose as the current actor.

I'm curious what the problem a 33ms query (or maybe several) have on your system (other than cluttering up the slow query log)? I don't want to prematurely optimize but something we could consider: moving the lock out of Postgres and into Ruby by having trying to take a non-blocking lock (pg_try_advisory_lock) and then loop/sleep in Ruby until successfully acquiring a lock.

mitchellhenke · 2022-10-24T15:27:21Z

First, do you need to be using concurrency controls with total_limit: 1? There was a point in GoodJob's history where apps had to use concurrency controls to prevent multiple cron-processes from enqueuing; that stopped being necessary with #423 (v2.5.0).

Thanks for pointing me to that, I had missed it. It doesn't seem like we would need the concurrency control, so I've proposed some changes to drop it from our usage.

I'm curious what the problem a 33ms query (or maybe several) have on your system (other than cluttering up the slow query log)? I don't want to prematurely optimize but something we could consider: moving the lock out of Postgres and into Ruby by having trying to take a non-blocking lock (pg_try_advisory_lock) and then loop/sleep in Ruby until successfully acquiring a lock.

Apologies, 33ms was what I was able to do locally, I had meant to include that it could be up to hundreds of milliseconds in larger deployments.

mitchellhenke · 2022-11-08T23:17:33Z

After dropping the concurrency controls, we did see some double-enqueue and double-runs and needed to re-enable it, which means this I'm hoping to work on this issue. We did disable preserve_job_records at the same time which may be a contributing factor.

A couple questions:

Is it possible to evaluate the locking mechanism around cron jobs?
If preserve_job_records is important to preventing double-enqueuing, would it be possible to add a configuration on jobs to specify which jobs should be preserved? Ideally I think we'd like to preserve only cron jobs in that case.

Lines from our logs showing multiple enqueues/runs for the same daily cron job:

job_class	name	timestamp
Job	perform.active_job	2022-11-06 10:00:00.593
Job	perform_start.active_job	2022-11-06 10:00:00.563
Job	enqueue.active_job	2022-11-06 10:00:00.562
Job	perform.active_job	2022-11-06 10:00:00.510
Job	perform_start.active_job	2022-11-06 10:00:00.272
Job	enqueue.active_job	2022-11-06 10:00:00.200

Job	perform.active_job	2022-11-05 10:00:00.457
Job	perform_start.active_job	2022-11-05 10:00:00.430
Job	enqueue.active_job	2022-11-05 10:00:00.422
Job	perform.active_job	2022-11-05 10:00:00.287
Job	perform_start.active_job	2022-11-05 10:00:00.221
Job	enqueue.active_job	2022-11-05 10:00:00.133

bensheldon · 2022-11-09T00:24:03Z

@mitchellhenke ah nuts! bummer 😞

Is it possible to evaluate the locking mechanism around cron jobs?

I'm open to doing this: moving the lock out of Postgres and into Ruby by having trying to take a non-blocking lock (pg_try_advisory_lock) and then loop/sleep in Ruby until successfully acquiring a lock. I don't think it's any worse than what we're doing now and has the benefit of not consuming a database connection while it waits. I guess the downside is that the lock wouldn't be first-in-first-out which is probably fine; for cron it doesn't matter which process enqueues the job.

would it be possible to add a configuration on jobs to specify which jobs should be preserved? Ideally I think we'd like to preserve only cron jobs in that case.

I think that would be possible, though designing the interface for it might be more difficult. I would want it to be more general than "preserve cron records", which maybe leads to something like 😬 :

GoodJob.config.immediately_destroy_record = lambda do |good_job_execution_record|
  good_job_execution_record.cron_key.blank?
end

Thinking of workarounds, would it be possible for you to preserve all job records for a trivial period of time, like 5 minutes?

zarqman · 2022-12-06T19:19:28Z

I'm seeing duplicate cron executions in our production environment too. Confirmed on v3.6.2, but it goes back into v2.x.
We use preserve_job_records: :on_unhandled_error.

In testing, it seems like the issue shows up if the enqueuing time exceeds the execution time, potentially due to network/database latency during scheduling. I've found that it's easier to duplicate by adding this to the cron-scheduled job:

before_enqueue do |job|
  sleep(rand)
end

With this, I've managed to get 5 processes to run the same cron job as many as all 5 times.

Then adding sleep(2) into the job's perform() helps cause ActiveRecord::RecordNotUnique errors to again be generated during enqueue.

This makes the issue seem like a race condition between scheduling and execution--at least when not preserving records. Every process's scheduling needs to complete before the execution completes, else there are duplicates scheduled.

Given this, I'm not sure whether moving to pg_try_advisory_lock helps, although it could still reduce database connection usage as @bensheldon already mentioned. More, if latency variance was wide enough, I believe even using standard concurrency controls could fail to prevent duplicate executions. It's also possible a tiny bit of clock skew between servers running the GoodJob processes might exacerbate the issue.

Testing with preserving records for 5 minutes does work around the issue, albeit with the generation of numerous ActiveRecord::RecordNotUnique error messages.

FWIW, the job to cleanup old preserved jobs also executes too many times (once in each process per interval).

Would it make sense to add another table to store cron-job last-run times? Scheduling could then lock on the relevant row, as well as have a permanent record for the UI's Last Run column.

Depending on the underlying scheduler, this might also allow for catching up on missed jobs, which is definitely separate from the issue at hand, but something we would find useful so as to not miss cron job executions during deployments.

zarqman · 2022-12-06T19:23:43Z

One more thought: given our use of preserve_job_records: :on_unhandled_error, setting cleanup_preserved_jobs_before_seconds_ago to 5 minutes is less than ideal. We currently have it set to several days so we have time to review failed jobs. 5 minutes wouldn't be quite enough for that. 😄

bensheldon · 2022-12-06T23:38:29Z

This makes the issue seem like a race condition between scheduling and execution--at least when not preserving records.

I think the "when not preserving records" is maybe the core issue on this. Pretty much all of the concurrency controlling aspects of Cron require a job record to be present for the duration of the window in which different cron processes attempt to insert a record.

A couple ideas:

Add a new good_job_cron_entries table that was used to track when the most recent cron run was for a given cron key. I would want it to be a 1-to-1 with cron-keys because I don't want yet another table to have to cleanup 😉
Don't delete job records that were inserted by cron. the cron process that is successful at inserting the job record could then also delete the previously inserted cron-related record. Hmm, I like the idea of that one.

bensheldon · 2022-12-10T22:25:04Z

It was really simple to not delete records that are inserted by cron: #767

Turns out that the job preservation and the automatic job destruction... are all entirely independent of the configuration for each other. I incorrectly thought that the automatic cleanup was only activated if the preserve-job-configuration was activated.

Which maybe leads to this comment from @zarqman:

FWIW, the job to cleanup old preserved jobs also executes too many times (once in each process per interval).

We might want to address this separately as I'm not sure if "too many times" means "more than expected" or like "an operationally dangerous number of times" 😓

zarqman · 2022-12-11T04:26:36Z

@bensheldon It's "more than expected", but not operationally dangerous. With 5 goodjob daemons, it runs 5x at each interval. Given that it runs against an indexed field, I wouldn't anticipate any significant issue.

If it were an issue, one could disable the automatic cleanup and replace it with a cron job, now that they'll only run once after #767. 😆

mitchellhenke · 2022-12-11T15:03:58Z

Awesome, thank you, will give it a go soon!

mitchellhenke · 2023-01-02T15:16:27Z

I'll hopefully have some more concrete data on the locking soon, but it looks good so far.

I wasn't sure if it was worth starting a new issue (just let me know if I should), but one of the other elements that can that's a factor in throughput in when running more than a few GoodJob processes is the notify/listen feature. When enqueuing many jobs (I'd ballpark around tens of jobs/second), it looks like each process is pinged and then queries for jobs which can generate a good amount of database activity.

Would you be open to a feature/configuration that allows disabling the notify/listen to rely solely on polling? It may also be helpful for those limited by database connections, but it's not a limitation at the moment for me.

bensheldon · 2023-01-02T17:25:18Z

@mitchellhenke it's ok to continue the conversation here 😄

That makes sense that Listen/Notify could get noisy when enqueuing a lot of jobs. No objection to having the ability to disable it. A few thoughts:

To simply disable it, introducing configuration like enable_listen_notify (default: true) seems easy enough. It gets a little weird because the Listener has evolved to be more of a general-purpose Reactor, so it's not as simple as just not initializing it, but I think this block could just have the logic to jump immediately to the "else, sleep" part of the logic:

good_job/lib/good_job/notifier.rb

Lines 221 to 239 in f683f37

    
           def wait_for_notify 
        
             raw_connection = connection.raw_connection 
        
             if raw_connection.respond_to?(:wait_for_notify) 
        
               raw_connection.wait_for_notify(WAIT_INTERVAL) do |channel, _pid, payload| 
        
                 yield(channel, payload) 
        
               end 
        
             elsif raw_connection.respond_to?(:jdbc_connection) 
        
               raw_connection.execute_query("SELECT 1") 
        
               notifications = raw_connection.jdbc_connection.getNotifications 
        
               Array(notifications).each do |notification| 
        
                 channel = notification.getName 
        
                 payload = notification.getParameter 
        
                 yield(channel, payload) 
        
               end 
        
               sleep WAIT_INTERVAL 
        
             else 
        
               sleep WAIT_INTERVAL 
        
             end 
        
           end

Are you doing some sort of bulk-enqueue? Another idea is to allow the ability to enqueue without sending the NOTIFY. I dunno if that would be something like GoodJob.set(notify: false) { MyJob.perform_later }, or adding an option to the adapter, adding a true bulk-enqueue interface (e.g. Add perform_all_later to enqueue multiple jobs at once rails/rails#46603). Just thinking aloud 😊
Lastly, from a database performance perspective, I wonder if I should introduce some additional slack into the LISTEN/NOTIFY system so that all the workers don't try to query the database on top of each other during a NOTIFY. Either some random jitter or maybe something like a sleep-sort based on the number of jobs the scheduler is already executing. I dunno at what point the complexity goes beyond my mandate for "good enough for most", but I am very committed to supporting your project.

zarqman · 2023-01-02T20:16:59Z

This sparked some thoughts--hope it's okay to wade in. 😄

Don't know if it applies in @mitchellhenke's case, but I believe Postgres dedupes NOTIFYs when inside a transaction and the payload is identical. So, bulk imports inside a single transaction might reduce the NOTIFYs (but that won't necessarily help large numbers of processes). Related, if scheduled_at isn't already rounded to whole seconds (a quick glance suggests maybe not), doing so might help ensure notification payloads can be deduped by Postgres.

As an alternative to enabling/disabling notifies either globally or even within a block, could the job priority be used for this? I'm thinking something along the lines of skipping NOTIFY when :priority is below a certain threshold. That threshold value would probably need its own config, but otherwise this reuses much of ActiveJob's syntax.

Yet another idea would be to exclude notifies for certain queue names, again avoiding needing too much additional syntax.

Lastly, on the LISTEN side, does GoodJob have any ability to set a minimum interval since the last query for jobs in response to a notification? This would seem to effectively dedup excess inbound notifications a bit, especially if it included a touch of jitter. Perhaps this is what you're already thinking?

mitchellhenke · 2023-01-02T21:02:59Z

To simply disable it, introducing configuration like enable_listen_notify (default: true) seems easy enough. It gets a little weird because the Listener has evolved to be more of a general-purpose Reactor, so it's not as simple as just not initializing it, but I think this block could just have the logic to jump immediately to the "else, sleep" part of the logic

Thanks, sounds good! I'll try to get a PR open in the next couple of weeks if I can.

Are you doing some sort of bulk-enqueue? Another idea is to allow the ability to enqueue without sending the NOTIFY. I dunno if that would be something like GoodJob.set(notify: false) { MyJob.perform_later }, or adding an option to the adapter, adding a true bulk-enqueue interface (e.g. Add perform_all_later to enqueue multiple jobs at once rails/rails#46603). Just thinking aloud 😊

No, I don't do much bulk-enqueuing generally, but I could see wanting to limit the number of NOTIFYs for it. I don't have a strong sense of whether an option/config is best or if it could be handled sufficiently with your ideas below.

Lastly, from a database performance perspective, I wonder if I should introduce some additional slack into the LISTEN/NOTIFY system so that all the workers don't try to query the database on top of each other during a NOTIFY. Either some random jitter or maybe something like a sleep-sort based on the number of jobs the scheduler is already executing. I dunno at what point the complexity goes beyond my mandate for "good enough for most", but I am very committed to supporting your project.

The use case I primarily work on is probably a little bit further out from the usual case. Having been on the open-source maintainer side, I would totally understand if you were not interested or willing to add complexity or compromises for a narrow use case. For my purposes, I think having the option to disable entirely is preferred and the jitter wouldn't be as useful.

bensheldon · 2023-02-06T16:12:21Z

@mitchellhenke fyi, I'm planning to release both #810 and #814 together because I think both pieces of functionality are useful (global and granular). I'm currently struggling with some flaky CI at the moment, but once I work through that I'll release 🙏🏻

Update: They're released 🚀

bensheldon added this to GoodJob Backlog v2 Oct 23, 2022

bensheldon moved this to Inbox in GoodJob Backlog v2 Oct 23, 2022

mitchellhenke mentioned this issue Oct 24, 2022

Remove concurrency limits from scheduled cron jobs in GoodJob 18F/identity-idp#7197

Merged

zarqman mentioned this issue Dec 11, 2022

Always preserve job records created by cron #767

Merged

mitchellhenke mentioned this issue Jan 24, 2023

Add global enable_listen_notify configuration to disable both notify and listen #810

Merged

bensheldon mentioned this issue Apr 20, 2023

config.good_job.preserve_job_records = false not working with CRON #927

Closed

bensheldon closed this as completed Apr 8, 2024

github-project-automation bot moved this from Inbox to Done in GoodJob Backlog v2 Apr 8, 2024

github-project-automation bot added this to Backlog Aug 26, 2024

github-project-automation bot moved this to Done in Backlog Aug 26, 2024

Earlopain mentioned this issue Oct 10, 2024

Job preservation for cron scheduled jobs #1506

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential locking race condition when using cron scheduler across multiple processes #731

Potential locking race condition when using cron scheduler across multiple processes #731

mitchellhenke commented Oct 23, 2022

bensheldon commented Oct 24, 2022 •

edited

Loading

mitchellhenke commented Oct 24, 2022

mitchellhenke commented Nov 8, 2022

bensheldon commented Nov 9, 2022 •

edited

Loading

zarqman commented Dec 6, 2022

zarqman commented Dec 6, 2022

bensheldon commented Dec 6, 2022

bensheldon commented Dec 10, 2022 •

edited

Loading

zarqman commented Dec 11, 2022 •

edited

Loading

mitchellhenke commented Dec 11, 2022

mitchellhenke commented Jan 2, 2023

bensheldon commented Jan 2, 2023

zarqman commented Jan 2, 2023

mitchellhenke commented Jan 2, 2023

bensheldon commented Feb 6, 2023 •

edited

Loading

Potential locking race condition when using cron scheduler across multiple processes #731

Potential locking race condition when using cron scheduler across multiple processes #731

Comments

mitchellhenke commented Oct 23, 2022

bensheldon commented Oct 24, 2022 • edited Loading

mitchellhenke commented Oct 24, 2022

mitchellhenke commented Nov 8, 2022

bensheldon commented Nov 9, 2022 • edited Loading

zarqman commented Dec 6, 2022

zarqman commented Dec 6, 2022

bensheldon commented Dec 6, 2022

bensheldon commented Dec 10, 2022 • edited Loading

zarqman commented Dec 11, 2022 • edited Loading

mitchellhenke commented Dec 11, 2022

mitchellhenke commented Jan 2, 2023

bensheldon commented Jan 2, 2023

zarqman commented Jan 2, 2023

mitchellhenke commented Jan 2, 2023

bensheldon commented Feb 6, 2023 • edited Loading

bensheldon commented Oct 24, 2022 •

edited

Loading

bensheldon commented Nov 9, 2022 •

edited

Loading

bensheldon commented Dec 10, 2022 •

edited

Loading

zarqman commented Dec 11, 2022 •

edited

Loading

bensheldon commented Feb 6, 2023 •

edited

Loading