-
-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential locking race condition when using cron scheduler across multiple processes #731
Comments
@mitchellhenke thanks for digging into this! It's complicated 😄 First, do you need to be using concurrency controls with
I think it might be troublesome. Just because a key (job or cron) is locked doesn't necessarily mean the lock will be exited successfully nor do I want to assume the has been taken by another actor for the same purpose as the current actor. I'm curious what the problem a 33ms query (or maybe several) have on your system (other than cluttering up the slow query log)? I don't want to prematurely optimize but something we could consider: moving the lock out of Postgres and into Ruby by having trying to take a non-blocking lock ( |
Thanks for pointing me to that, I had missed it. It doesn't seem like we would need the concurrency control, so I've proposed some changes to drop it from our usage.
Apologies, 33ms was what I was able to do locally, I had meant to include that it could be up to hundreds of milliseconds in larger deployments. |
After dropping the concurrency controls, we did see some double-enqueue and double-runs and needed to re-enable it, which means this I'm hoping to work on this issue. We did disable A couple questions:
Lines from our logs showing multiple enqueues/runs for the same daily cron job:
|
@mitchellhenke ah nuts! bummer 😞
I'm open to doing this: moving the lock out of Postgres and into Ruby by having trying to take a non-blocking lock (
I think that would be possible, though designing the interface for it might be more difficult. I would want it to be more general than "preserve cron records", which maybe leads to something like 😬 : GoodJob.config.immediately_destroy_record = lambda do |good_job_execution_record|
good_job_execution_record.cron_key.blank?
end Thinking of workarounds, would it be possible for you to preserve all job records for a trivial period of time, like 5 minutes? |
I'm seeing duplicate cron executions in our production environment too. Confirmed on v3.6.2, but it goes back into v2.x. In testing, it seems like the issue shows up if the enqueuing time exceeds the execution time, potentially due to network/database latency during scheduling. I've found that it's easier to duplicate by adding this to the cron-scheduled job: before_enqueue do |job|
sleep(rand)
end With this, I've managed to get 5 processes to run the same cron job as many as all 5 times. Then adding This makes the issue seem like a race condition between scheduling and execution--at least when not preserving records. Every process's scheduling needs to complete before the execution completes, else there are duplicates scheduled. Given this, I'm not sure whether moving to Testing with preserving records for 5 minutes does work around the issue, albeit with the generation of numerous FWIW, the job to cleanup old preserved jobs also executes too many times (once in each process per interval). Would it make sense to add another table to store cron-job last-run times? Scheduling could then lock on the relevant row, as well as have a permanent record for the UI's Last Run column. Depending on the underlying scheduler, this might also allow for catching up on missed jobs, which is definitely separate from the issue at hand, but something we would find useful so as to not miss cron job executions during deployments. |
One more thought: given our use of |
I think the "when not preserving records" is maybe the core issue on this. Pretty much all of the concurrency controlling aspects of Cron require a job record to be present for the duration of the window in which different cron processes attempt to insert a record. A couple ideas:
|
It was really simple to not delete records that are inserted by cron: #767 Turns out that the job preservation and the automatic job destruction... are all entirely independent of the configuration for each other. I incorrectly thought that the automatic cleanup was only activated if the preserve-job-configuration was activated. Which maybe leads to this comment from @zarqman:
We might want to address this separately as I'm not sure if "too many times" means "more than expected" or like "an operationally dangerous number of times" 😓 |
@bensheldon It's "more than expected", but not operationally dangerous. With 5 goodjob daemons, it runs 5x at each interval. Given that it runs against an indexed field, I wouldn't anticipate any significant issue. If it were an issue, one could disable the automatic cleanup and replace it with a cron job, now that they'll only run once after #767. 😆 |
Awesome, thank you, will give it a go soon! |
I'll hopefully have some more concrete data on the locking soon, but it looks good so far. I wasn't sure if it was worth starting a new issue (just let me know if I should), but one of the other elements that can that's a factor in throughput in when running more than a few GoodJob processes is the notify/listen feature. When enqueuing many jobs (I'd ballpark around tens of jobs/second), it looks like each process is pinged and then queries for jobs which can generate a good amount of database activity. Would you be open to a feature/configuration that allows disabling the notify/listen to rely solely on polling? It may also be helpful for those limited by database connections, but it's not a limitation at the moment for me. |
@mitchellhenke it's ok to continue the conversation here 😄 That makes sense that Listen/Notify could get noisy when enqueuing a lot of jobs. No objection to having the ability to disable it. A few thoughts:
good_job/lib/good_job/notifier.rb Lines 221 to 239 in f683f37
|
This sparked some thoughts--hope it's okay to wade in. 😄 Don't know if it applies in @mitchellhenke's case, but I believe Postgres dedupes As an alternative to enabling/disabling notifies either globally or even within a block, could the job priority be used for this? I'm thinking something along the lines of skipping Yet another idea would be to exclude notifies for certain queue names, again avoiding needing too much additional syntax. Lastly, on the |
Thanks, sounds good! I'll try to get a PR open in the next couple of weeks if I can.
No, I don't do much bulk-enqueuing generally, but I could see wanting to limit the number of NOTIFYs for it. I don't have a strong sense of whether an option/config is best or if it could be handled sufficiently with your ideas below.
The use case I primarily work on is probably a little bit further out from the usual case. Having been on the open-source maintainer side, I would totally understand if you were not interested or willing to add complexity or compromises for a narrow use case. For my purposes, I think having the option to disable entirely is preferred and the jitter wouldn't be as useful. |
@mitchellhenke fyi, I'm planning to release both #810 and #814 together because I think both pieces of functionality are useful (global and granular). I'm currently struggling with some flaky CI at the moment, but once I work through that I'll release 🙏🏻 Update: They're released 🚀 |
I've typically only seen this under load-testing scenarios with many job processes, but Postgres was spending long time in
pg_advisory_lock
. I was only able to findpg_advisory_lock
being used in GoodJob for cron-scheduled jobs (here and here). It looked like the other locking calls used the non-blockingpg_try_advisory_lock
.The time spent getting the lock seems to primarily be from the
around_enqueue
lock when a new cron job is created and then all of the processes try to lock the cron key to see if it is enqueued, and they end up waiting for one another.The best way I've been able to reproduce it locally is have a frequently running cron-job (I chose every minute), and then running
good_job start
for the sample application in a handful of terminal panes (6 was sufficient on my machine). I have a branch/commit here. With 6 running good job processes, the average time spent callingpg_advisory_lock
was around 33ms according topg_stat_statements
, and I suspect it grows as you add processes/cron jobs.I've only used GoodJob with cron jobs where the concurrency
total_limit
is 1, so I'm not sure on a good way to help when using limits larger than that, but I have a small idea for the1
case at least. Would it be feasible to switch topg_try_advisory_lock
if thelimit
is 1? I think there's no need to check the current concurrency for performing/enqueuing if someone else has the lock, so it should be fine for the current process to let it go? I've pushed up a branch with that idea here.I'm a bit out of my depth on concurrency stuff like this, so do feel free to let me know if I'm off-base. Thank you again!
The text was updated successfully, but these errors were encountered: