-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race conditions in ActiveJob concurrency extension #325
Comments
@codyrobbins this is fantastic analysis. Thank you for writing this up! And thank you for digging deeply into this new functionality too. I would say that at this point you are experienced 😄 I'll review your PR for the enqueue step. Changing it from a For To explain how enqueue and perform are different in orchestration between
This explanation is spot on with my understanding ➕ :
Would you mind sharing your code for deterministically ordering/counting the jobs? That sounds like a good solution. The concurrency controls are bleeding edge for good and bad, but I want to ensure they work. |
@bensheldon Thanks!! Ah, everything you wrote makes sense. What I am referring to as ‘pre-perform’ or ‘not “really” performing yet’ state in my comment above might more properly then be called the execution step. In that case the current code is basically using a count of the jobs in the execution state to limit which jobs enter the perform state, but since the critical section is not wrapping entrance to the execution state we’re left with a situation where lots of jobs can simultaneously be executing which then blocks perform.
Sure, the PR also actually has my fix for the perform step. The relevant changes are here: It’s basically ordering the locked jobs by |
@codyrobbins thank you so much! I'll get your PR merged and released tomorrow. Another question while I have you since you're now a GoodJob concurrency expert. Do you have perspective on the question in #317 (comment) Should |
I’ve been testing out the brand new concurrency controls for use in a production system and unfortunately in practice they don’t appear to be working correctly. There seem to be two separate issues with the controls around enqueue and perform, but they both boil down to race conditions. I’ve created a pull request that I believe addresses the issues that we are seeing in our particular case, but I’m not familiar enough with the GoodJob or ActiveJob internals to be sure that there are not other problems I’m unaware of introduced by what I’ve done here.
In case it makes a difference, I’ve been running into these particular issues with the following concurrency limits:
The goal here is to ensure that the job is never being performed more than once at the same time, but we can enqueue two additional jobs to act as a debounce rather than a throttle on job performance.
For enqueue, it appears as though the problem is that the advisory lock is released in the
before_enqueue
callback before the job actually gets enqueued, and this is allowing multiple jobs to be enqueued simultaneously with the advisory lock essentially being useless.good_job/lib/good_job/active_job_extensions/concurrency.rb
Lines 22 to 27 in 4833fb7
Immediately after the advisory lock is relinquished at the end of this block, but before control is returned to the code that goes ahead and actually does the enqueueing, another job can enter the critical section and can calculate an invalid
enqueue_concurrency
because it won’t take into account the job that has just been allowed to enqueue but has not actually done so yet.I’m guessing there are other internal GoodJob or ActiveJob callbacks that need to run after this code but before the enqueue occurs which exacerbate this problem and make it more likely to occur, although even if there aren’t I think instructions can still be interleaved in such a way that this happens. In practice it seems to end up happening a significant proportion of the time for me and with the above concurrency configuration I was getting more than two jobs enqueued at the same time consistently.
The simple solution I came up with that seems to resolve this problem is changing the before callback to an around one. This ensures the advisory lock is not released until the enqueue operation has actually occurred and that no other jobs can enter the critical section in the meantime.
For perform, the problem we’re seeing is a huge amount of lock contention that in practice introduces heavy unnecessary delays to queue processing but in certain circumstances seems to prevent the jobs from running indefinitely.
When a job begins to be performed, its row is locked here:
good_job/lib/good_job/job.rb
Line 161 in 4833fb7
The calculation of perform concurrency in
before_perform
then basically counts the number of jobs that have had their rows locked in this manner:good_job/lib/good_job/active_job_extensions/concurrency.rb
Line 44 in 4833fb7
The problem is that in practice what ends up happening with our configuration is that by the time a job hits the critical section where
perform_concurrency
is calculated, other jobs have simultaneously already begun performing and had their rows locked in the same way even if they’re stuck waiting for the advisory lock to enter the critical section. Thus they all end up calculating that there are too many jobs being performed and they all back off, despite the fact that none of them are “really” performing yet. The same thing happens repeatedly until the attempts interleave in such a way by chance that this lock contention doesn’t occur. In a few cases, however, I have seen it settle into a relatively stable state where this happens continously and seemingly indefinitely.I think the problem is that
perform_concurrency
is really a calculation of the number of jobs that have entered a pre-performance state—none of them begin “actually” performing until they pass the critical section successfully.The fix I came up with for this problem is to change the perform concurrency calculation to deterministically pick which jobs “should” be performing right now and raising for any job not in this set so that it gets rescheduled. This boils down to ordering the jobs by when they began performing and then allowing the first n jobs to proceed. I think there may be a few different ways of fixing this issue, but I’m not sure of the relative merits and this seemed like a straightforward approach that works in practice for our application.
I’m not a concurrency expert by any means nor overly familiar with ActiveJob in general, so like I said there may be more issues going on here than I’m aware of and these particular fixes may be unsuitable for a variety of reasons, but hopefully it’s at least a start. Happy to provide more details if you need more info to help surface the problem!
The text was updated successfully, but these errors were encountered: