-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MaxConcurrency does not work #55
Comments
I believe the issue may be in this lua function local function canRun(lockKey, maxConcurrency)
local activeJobs = tonumber(redis.call('get', lockKey))
if (not maxConcurrency or maxConcurrency == 0) or (not activeJobs or activeJobs < maxConcurrency) then
-- default case: maxConcurrency not defined or set to 0 means no cap on concurrent jobs OR
-- maxConcurrency set, but lock does not yet exist OR
-- maxConcurrency set, lock is set, but not yet at max concurrency
return true
else
-- we are at max capacity for running jobs
return false
end
end I am not a Redis expert by any means, but I think what might be happening is that the |
Further research suggests that a proper locking implementation needs to be in place for this to work. Out of A possible solution might be the Redlock Algorithm as described on the Redis website. I'd appreciate any feedback on this. I am not sure if I am completely off track here. |
One critical piece you seem to be missing is that Lua scripts are atomic. |
Yep, that'd be a pretty critical piece. Good to know. Thanks for the link! Given that, I'm not sure what could be going on. Perhaps the lock data in redis got into a bad state when I upgraded the version of gowork, or some other weird thing. I am going to try spinning down all my servers and deleting the lock keys, then spinning everything up fresh. If that doesn't work, I'm not sure what to do next to debug this further. |
Finally had a bit of time this morning to look at this again. After further examination, I believe the issue lies here: if haveJobs(jobQueue) and not isPaused(pauseKey) and canRun(lockKey, maxConcurrency) then
acquireLock(lockKey, lockInfoKey, workerPoolID)
res = redis.call('rpoplpush', jobQueue, inProgQueue)
return {res, jobQueue, inProgQueue}
end Specifically the I believe the solution is twofold:
The second entry above is necessary to patch any running processes that may be affected by this issue. The other option is to have people go in and manually delete lock keys. |
@tylerb this most recent commit should delegate lock cleanup to the reaper, which runs every ~10 min (see this lua script). Can you confirm that you're still facing issues even with these latest reaper changes? |
@shdunning I didn't know that functionality was in there. Thanks for the info! I didn't let my code run with Regardless, I can't set In addition, looking at this, it seems the reaper looks for dead processes and decrements the lock value by whatever value is stored in the hash for that specific process. In my testing, I've seen lock value go negative in the hash. Given this, I feel it would be a good idea to either:
|
@tylerb multiple items to address, bear with me. RE: reap cycles: The reaper does run immediately at startup, but unfortunately in most CI/CD environments the first reap cycle ends up being a no-op because the dead worker pool's heartbeat expiration is less than the deadTime. Since the heartbeats are refreshed every 5 seconds, we've discussed adjusting the deadTime to something substantially less than 5 minutes in order to determine whether or not a worker pool is dead which would increase the chances of successful reaping on that first reap cycle. I think this might be our best path forward to help avoid potential 10 minute deadlocks in your env. RE: negative lock values: We noticed that too in our environments and PR #52 was designed to address the issue. We've been running with these changes since 6/22 (and tested in our staging env weeks before that) and have no longer observed the negative lock values. If you do continue to see them, then we've more work to do! RE: your 2 proposed changes above:
In general I'm not a huge fan of just correcting bad lock state on-the-fly. I'd rather understand the situations that are getting us into the bad state (like a dirty shutdown) and make the software more resilient to them. Thanks for the diligence here -- we'll work through this! |
Thanks for your time and consideration on all this, @shdunning. Much appreciated. I am certainly new to all this stuff, so I appreciate your explanation and discussion. Reap cycles: Adjusting the deadTime does sound like a much better approach. In addition, adding a delay to the first reap after startup may also prove useful. Perhaps that would give dead workers a chance to expire. Negative lock values: I got them immediately after setting As for my two proposed changes:
I do not know exactly how it got into that state. I would assume that my jobs would terminate and decrement the lock value as part of a graceful shutdown. Perhaps heroku hard killed the binary before jobs completed, leaving the lock value in a bad state. Please let me know what else I can do to assist with this. Thanks again! |
Come to think of it, I know heroku has killed my worker processes many, many times in the past due to the bug I fixed in my previous PR: #54 As my redis data has never been reset, it's certainly possible my lock values were all kinds of screwed up. |
@tylerb yeah, unfortunately there was a bug in the reaper when More soon. |
Fixed by #58. |
This is going to be vague, as I'm currently trying to debug what's going on, but the
MaxConcurrency
option seems to be broken.According to my current understanding:
For example, the
:lock
key for a job, under load, should be the same as the value ofMaxConcurrency
, but mine stays at 0.The
:lock_info
key should have individual lock counts, per worker, that add up to the total of:lock
for that job. However, the contents of that hash vary wildly between-4
and4
. I am testing with aMaxConcurrency
of 1.It seems there is a race condition in the increment/decrement logic. I have a worker pool size of 4, so the -4 and 4 variance could come from there. Perhaps the system is attempting to pull 4 jobs at once, and all 4 succeed in incrementing/decrementing?
The text was updated successfully, but these errors were encountered: