Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential locking race condition when using cron scheduler across multiple processes #731

Closed
mitchellhenke opened this issue Oct 23, 2022 · 15 comments

Comments

@mitchellhenke
Copy link
Contributor

I've typically only seen this under load-testing scenarios with many job processes, but Postgres was spending long time in pg_advisory_lock. I was only able to find pg_advisory_lock being used in GoodJob for cron-scheduled jobs (here and here). It looked like the other locking calls used the non-blocking pg_try_advisory_lock.

The time spent getting the lock seems to primarily be from the around_enqueue lock when a new cron job is created and then all of the processes try to lock the cron key to see if it is enqueued, and they end up waiting for one another.

The best way I've been able to reproduce it locally is have a frequently running cron-job (I chose every minute), and then running good_job start for the sample application in a handful of terminal panes (6 was sufficient on my machine). I have a branch/commit here. With 6 running good job processes, the average time spent calling pg_advisory_lock was around 33ms according to pg_stat_statements, and I suspect it grows as you add processes/cron jobs.

I've only used GoodJob with cron jobs where the concurrency total_limit is 1, so I'm not sure on a good way to help when using limits larger than that, but I have a small idea for the 1 case at least. Would it be feasible to switch to pg_try_advisory_lock if the limit is 1? I think there's no need to check the current concurrency for performing/enqueuing if someone else has the lock, so it should be fine for the current process to let it go? I've pushed up a branch with that idea here.

I'm a bit out of my depth on concurrency stuff like this, so do feel free to let me know if I'm off-base. Thank you again!

@bensheldon
Copy link
Owner

bensheldon commented Oct 24, 2022

@mitchellhenke thanks for digging into this! It's complicated 😄

First, do you need to be using concurrency controls with total_limit: 1? There was a point in GoodJob's history where apps had to use concurrency controls to prevent multiple cron-processes from enqueuing; that stopped being necessary with #423 (v2.5.0).

Would it be feasible to switch to pg_try_advisory_lock if the limit is 1? I think there's no need to check the current concurrency for performing/enqueuing if someone else has the lock, so it should be fine for the current process to let it go?

I think it might be troublesome. Just because a key (job or cron) is locked doesn't necessarily mean the lock will be exited successfully nor do I want to assume the has been taken by another actor for the same purpose as the current actor.

I'm curious what the problem a 33ms query (or maybe several) have on your system (other than cluttering up the slow query log)? I don't want to prematurely optimize but something we could consider: moving the lock out of Postgres and into Ruby by having trying to take a non-blocking lock (pg_try_advisory_lock) and then loop/sleep in Ruby until successfully acquiring a lock.

@mitchellhenke
Copy link
Contributor Author

First, do you need to be using concurrency controls with total_limit: 1? There was a point in GoodJob's history where apps had to use concurrency controls to prevent multiple cron-processes from enqueuing; that stopped being necessary with #423 (v2.5.0).

Thanks for pointing me to that, I had missed it. It doesn't seem like we would need the concurrency control, so I've proposed some changes to drop it from our usage.

I'm curious what the problem a 33ms query (or maybe several) have on your system (other than cluttering up the slow query log)? I don't want to prematurely optimize but something we could consider: moving the lock out of Postgres and into Ruby by having trying to take a non-blocking lock (pg_try_advisory_lock) and then loop/sleep in Ruby until successfully acquiring a lock.

Apologies, 33ms was what I was able to do locally, I had meant to include that it could be up to hundreds of milliseconds in larger deployments.

image

@mitchellhenke
Copy link
Contributor Author

After dropping the concurrency controls, we did see some double-enqueue and double-runs and needed to re-enable it, which means this I'm hoping to work on this issue. We did disable preserve_job_records at the same time which may be a contributing factor.

A couple questions:

  1. Is it possible to evaluate the locking mechanism around cron jobs?
  2. If preserve_job_records is important to preventing double-enqueuing, would it be possible to add a configuration on jobs to specify which jobs should be preserved? Ideally I think we'd like to preserve only cron jobs in that case.

Lines from our logs showing multiple enqueues/runs for the same daily cron job:


job_class name timestamp
Job perform.active_job 2022-11-06 10:00:00.593
Job perform_start.active_job 2022-11-06 10:00:00.563
Job enqueue.active_job 2022-11-06 10:00:00.562
Job perform.active_job 2022-11-06 10:00:00.510
Job perform_start.active_job 2022-11-06 10:00:00.272
Job enqueue.active_job 2022-11-06 10:00:00.200
Job perform.active_job 2022-11-05 10:00:00.457
Job perform_start.active_job 2022-11-05 10:00:00.430
Job enqueue.active_job 2022-11-05 10:00:00.422
Job perform.active_job 2022-11-05 10:00:00.287
Job perform_start.active_job 2022-11-05 10:00:00.221
Job enqueue.active_job 2022-11-05 10:00:00.133

@bensheldon
Copy link
Owner

bensheldon commented Nov 9, 2022

@mitchellhenke ah nuts! bummer 😞

Is it possible to evaluate the locking mechanism around cron jobs?

I'm open to doing this: moving the lock out of Postgres and into Ruby by having trying to take a non-blocking lock (pg_try_advisory_lock) and then loop/sleep in Ruby until successfully acquiring a lock. I don't think it's any worse than what we're doing now and has the benefit of not consuming a database connection while it waits. I guess the downside is that the lock wouldn't be first-in-first-out which is probably fine; for cron it doesn't matter which process enqueues the job.

would it be possible to add a configuration on jobs to specify which jobs should be preserved? Ideally I think we'd like to preserve only cron jobs in that case.

I think that would be possible, though designing the interface for it might be more difficult. I would want it to be more general than "preserve cron records", which maybe leads to something like 😬 :

GoodJob.config.immediately_destroy_record = lambda do |good_job_execution_record|
  good_job_execution_record.cron_key.blank?
end

Thinking of workarounds, would it be possible for you to preserve all job records for a trivial period of time, like 5 minutes?

@zarqman
Copy link
Contributor

zarqman commented Dec 6, 2022

I'm seeing duplicate cron executions in our production environment too. Confirmed on v3.6.2, but it goes back into v2.x.
We use preserve_job_records: :on_unhandled_error.

In testing, it seems like the issue shows up if the enqueuing time exceeds the execution time, potentially due to network/database latency during scheduling. I've found that it's easier to duplicate by adding this to the cron-scheduled job:

before_enqueue do |job|
  sleep(rand)
end

With this, I've managed to get 5 processes to run the same cron job as many as all 5 times.

Then adding sleep(2) into the job's perform() helps cause ActiveRecord::RecordNotUnique errors to again be generated during enqueue.

This makes the issue seem like a race condition between scheduling and execution--at least when not preserving records. Every process's scheduling needs to complete before the execution completes, else there are duplicates scheduled.

Given this, I'm not sure whether moving to pg_try_advisory_lock helps, although it could still reduce database connection usage as @bensheldon already mentioned. More, if latency variance was wide enough, I believe even using standard concurrency controls could fail to prevent duplicate executions. It's also possible a tiny bit of clock skew between servers running the GoodJob processes might exacerbate the issue.

Testing with preserving records for 5 minutes does work around the issue, albeit with the generation of numerous ActiveRecord::RecordNotUnique error messages.

FWIW, the job to cleanup old preserved jobs also executes too many times (once in each process per interval).

Would it make sense to add another table to store cron-job last-run times? Scheduling could then lock on the relevant row, as well as have a permanent record for the UI's Last Run column.

Depending on the underlying scheduler, this might also allow for catching up on missed jobs, which is definitely separate from the issue at hand, but something we would find useful so as to not miss cron job executions during deployments.

@zarqman
Copy link
Contributor

zarqman commented Dec 6, 2022

One more thought: given our use of preserve_job_records: :on_unhandled_error, setting cleanup_preserved_jobs_before_seconds_ago to 5 minutes is less than ideal. We currently have it set to several days so we have time to review failed jobs. 5 minutes wouldn't be quite enough for that. 😄

@bensheldon
Copy link
Owner

This makes the issue seem like a race condition between scheduling and execution--at least when not preserving records.

I think the "when not preserving records" is maybe the core issue on this. Pretty much all of the concurrency controlling aspects of Cron require a job record to be present for the duration of the window in which different cron processes attempt to insert a record.

A couple ideas:

  • Add a new good_job_cron_entries table that was used to track when the most recent cron run was for a given cron key. I would want it to be a 1-to-1 with cron-keys because I don't want yet another table to have to cleanup 😉
  • Don't delete job records that were inserted by cron. the cron process that is successful at inserting the job record could then also delete the previously inserted cron-related record. Hmm, I like the idea of that one.

@bensheldon
Copy link
Owner

bensheldon commented Dec 10, 2022

It was really simple to not delete records that are inserted by cron: #767

Turns out that the job preservation and the automatic job destruction... are all entirely independent of the configuration for each other. I incorrectly thought that the automatic cleanup was only activated if the preserve-job-configuration was activated.

Which maybe leads to this comment from @zarqman:

FWIW, the job to cleanup old preserved jobs also executes too many times (once in each process per interval).

We might want to address this separately as I'm not sure if "too many times" means "more than expected" or like "an operationally dangerous number of times" 😓

@zarqman
Copy link
Contributor

zarqman commented Dec 11, 2022

@bensheldon It's "more than expected", but not operationally dangerous. With 5 goodjob daemons, it runs 5x at each interval. Given that it runs against an indexed field, I wouldn't anticipate any significant issue.

If it were an issue, one could disable the automatic cleanup and replace it with a cron job, now that they'll only run once after #767. 😆

@mitchellhenke
Copy link
Contributor Author

Awesome, thank you, will give it a go soon!

@mitchellhenke
Copy link
Contributor Author

I'll hopefully have some more concrete data on the locking soon, but it looks good so far.

I wasn't sure if it was worth starting a new issue (just let me know if I should), but one of the other elements that can that's a factor in throughput in when running more than a few GoodJob processes is the notify/listen feature. When enqueuing many jobs (I'd ballpark around tens of jobs/second), it looks like each process is pinged and then queries for jobs which can generate a good amount of database activity.

Would you be open to a feature/configuration that allows disabling the notify/listen to rely solely on polling? It may also be helpful for those limited by database connections, but it's not a limitation at the moment for me.

@bensheldon
Copy link
Owner

@mitchellhenke it's ok to continue the conversation here 😄

That makes sense that Listen/Notify could get noisy when enqueuing a lot of jobs. No objection to having the ability to disable it. A few thoughts:

  • To simply disable it, introducing configuration like enable_listen_notify (default: true) seems easy enough. It gets a little weird because the Listener has evolved to be more of a general-purpose Reactor, so it's not as simple as just not initializing it, but I think this block could just have the logic to jump immediately to the "else, sleep" part of the logic:

def wait_for_notify
raw_connection = connection.raw_connection
if raw_connection.respond_to?(:wait_for_notify)
raw_connection.wait_for_notify(WAIT_INTERVAL) do |channel, _pid, payload|
yield(channel, payload)
end
elsif raw_connection.respond_to?(:jdbc_connection)
raw_connection.execute_query("SELECT 1")
notifications = raw_connection.jdbc_connection.getNotifications
Array(notifications).each do |notification|
channel = notification.getName
payload = notification.getParameter
yield(channel, payload)
end
sleep WAIT_INTERVAL
else
sleep WAIT_INTERVAL
end
end

  • Are you doing some sort of bulk-enqueue? Another idea is to allow the ability to enqueue without sending the NOTIFY. I dunno if that would be something like GoodJob.set(notify: false) { MyJob.perform_later }, or adding an option to the adapter, adding a true bulk-enqueue interface (e.g. Add perform_all_later to enqueue multiple jobs at once rails/rails#46603). Just thinking aloud 😊
  • Lastly, from a database performance perspective, I wonder if I should introduce some additional slack into the LISTEN/NOTIFY system so that all the workers don't try to query the database on top of each other during a NOTIFY. Either some random jitter or maybe something like a sleep-sort based on the number of jobs the scheduler is already executing. I dunno at what point the complexity goes beyond my mandate for "good enough for most", but I am very committed to supporting your project.

@zarqman
Copy link
Contributor

zarqman commented Jan 2, 2023

This sparked some thoughts--hope it's okay to wade in. 😄

Don't know if it applies in @mitchellhenke's case, but I believe Postgres dedupes NOTIFYs when inside a transaction and the payload is identical. So, bulk imports inside a single transaction might reduce the NOTIFYs (but that won't necessarily help large numbers of processes). Related, if scheduled_at isn't already rounded to whole seconds (a quick glance suggests maybe not), doing so might help ensure notification payloads can be deduped by Postgres.

As an alternative to enabling/disabling notifies either globally or even within a block, could the job priority be used for this? I'm thinking something along the lines of skipping NOTIFY when :priority is below a certain threshold. That threshold value would probably need its own config, but otherwise this reuses much of ActiveJob's syntax.

Yet another idea would be to exclude notifies for certain queue names, again avoiding needing too much additional syntax.

Lastly, on the LISTEN side, does GoodJob have any ability to set a minimum interval since the last query for jobs in response to a notification? This would seem to effectively dedup excess inbound notifications a bit, especially if it included a touch of jitter. Perhaps this is what you're already thinking?

@mitchellhenke
Copy link
Contributor Author

To simply disable it, introducing configuration like enable_listen_notify (default: true) seems easy enough. It gets a little weird because the Listener has evolved to be more of a general-purpose Reactor, so it's not as simple as just not initializing it, but I think this block could just have the logic to jump immediately to the "else, sleep" part of the logic

Thanks, sounds good! I'll try to get a PR open in the next couple of weeks if I can.

Are you doing some sort of bulk-enqueue? Another idea is to allow the ability to enqueue without sending the NOTIFY. I dunno if that would be something like GoodJob.set(notify: false) { MyJob.perform_later }, or adding an option to the adapter, adding a true bulk-enqueue interface (e.g. Add perform_all_later to enqueue multiple jobs at once rails/rails#46603). Just thinking aloud 😊

No, I don't do much bulk-enqueuing generally, but I could see wanting to limit the number of NOTIFYs for it. I don't have a strong sense of whether an option/config is best or if it could be handled sufficiently with your ideas below.

Lastly, from a database performance perspective, I wonder if I should introduce some additional slack into the LISTEN/NOTIFY system so that all the workers don't try to query the database on top of each other during a NOTIFY. Either some random jitter or maybe something like a sleep-sort based on the number of jobs the scheduler is already executing. I dunno at what point the complexity goes beyond my mandate for "good enough for most", but I am very committed to supporting your project.

The use case I primarily work on is probably a little bit further out from the usual case. Having been on the open-source maintainer side, I would totally understand if you were not interested or willing to add complexity or compromises for a narrow use case. For my purposes, I think having the option to disable entirely is preferred and the jitter wouldn't be as useful.

@bensheldon
Copy link
Owner

bensheldon commented Feb 6, 2023

@mitchellhenke fyi, I'm planning to release both #810 and #814 together because I think both pieces of functionality are useful (global and granular). I'm currently struggling with some flaky CI at the moment, but once I work through that I'll release 🙏🏻

Update: They're released 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants