-
-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
:until_executed jobs get stuck every now and then #379
Comments
Could it be because of restarts or removal of queues? Some people reported that... Would be helpful if you could share what versions you are on. |
Workers are scaled depending on load so yes this would be possible.
|
That is a tough nut to crack @blarralde. I'm not sure I can do much about that. It would be helpful what is in your logs around the restart. If the worker is killed right before unlock it would leave a key. Also bump the gem one or two more versions. I solved some duplicate key problems in the last release. Not sure if it is theone you have there. |
You have the correct version there, should be as fine as I have been able to make it for now. |
Thanks @mhenrixon, I'll try to bump the version... Shouldn't this normally take care of deleting keys when workers are shut down?
|
The death handlers are for jobs that run out of retry. In newer versions of sidekiq even for jobs that have retry 0, false. Sidekiq should put the job back on the queue if it crashes but unfortunately in some situations like when restarting a worker it can't if it gets sigkill ( I am sure there are still things I need to improve in the gem. Version 6 really tries hard to prevent duplicates. Previous versions of the gem, unfortunately, did not prevent duplicates like expected. I made the gem super strict for v6 on purpose. Still getting to some problems like the one you are experiencing. Couple of things I could do:
|
Here's a quick and dirty garbage collector, let me know what you think: module SidekiqUniqueGarbageCollector
extend self
def clean_unused
unused_digests.each do |digest|
SidekiqUniqueJobs::Digests.del digest: digest
end
end
private
def all_active_digests
busy_digests + queued_digests
end
def all_digests
SidekiqUniqueJobs::Digests.all.map{|d| d.gsub(SIDEKIQ_NAMESPACE, '')}
end
def all_queues
[
Sidekiq::RetrySet.new,
Sidekiq::ScheduledSet.new,
] + named_queues
end
def busy_digests
Sidekiq::Workers.new.map{|process, thread, msg| Sidekiq::Job.new(msg['payload'])['unique_digest'] }
end
def named_queues
queue_names.map{|name| Sidekiq::Queue.new(name) }
end
def queued_digests
all_queues.map{|queue| queue.map{|job| job['unique_digest'] } }.flatten.compact
end
def queue_names
Sidekiq::Queue.all.map(&:name)
end
def unused_digests
all_digests - all_active_digests
end
end |
would adding a |
@blarralde that looks really promising! I didn't even know about that I could find the busy jobs/digests like that. The only problem with that is that it would have to be in Lua I think. Right now when you run this code from one millisecond to another something might change. That isn't the case with Lua, Lua is like a database transaction lock. During the script execution, nothing can be changed from another process. |
@mhenrixon - just to double check - would implementing something like this help with the issue we have here? https://github.com/mhenrixon/sidekiq-unique-jobs#cleanup-dead-locks On another note - apparently one of my colleagues downgraded to 6.0.8 and the problem hasn't manifested for their project anymore - could this have been an issue introduced between that version and the latest? |
I’ll have a look at the diff early next week but if you look at the previous issues the updates since 6.0.8 attempts to improve this situation.
|
@mhenrixon any news on this? we have just run with this issue in our project |
We stopped using sidekiq-unique-jobs for the particular job that we were having problems with and are now using redlock-rb which is behaving well and provides a lock expiration in case things go wrong. This is a job that runs frequently (once every couple of seconds). It reenqueues itself and we have a Sidekiq cron gem that will enqueue it every minute in case it happened to stop. As far as I can tell, the many other jobs that use sidekiq-unique-jobs have not had a problem. They don't run at this high frequency. Also, I don't think our problem was connected to restarts or removal of queues. We don't remove queues and I didn't see restarts happening around the time that the job stopped executing. |
@maleksiuk maybe the performance is the problem for you? I'll see if I can add some performance tests/benchmarks for the locks. I am also planning on rewriting the internals after adding some better coverage. I can replicate the problem in a console but I cannot for the life of me replicate the problem with a test :( In your case it could be something with sidekiq-cron causing problems as well. For the rest of you sidekiq/sidekiq#4141 will most certainly help with the odd issue. The gem still has some weird problem with unlocking in certain conditions. I am really digging into this right now. |
Hello @mhenrixon i know a very easy way to reproduce it on production - put redis on another host and periodically made a disconnections between them. |
@simonoff do you mean you have a multiserver redis setup? |
No, we have a multi-server workers deployment, but Redis server only one. |
Don't know if this will help you guys find a fix for this but, I can +1 this issue. Just found a situation where an app running on Heroku got a maintenance (initiated by Heroku/Platform) restart and it failed to restart the process with SIGTERM and did SIGKILL. For some reason, even after receiving a SIGTERM, sidekiq had jobs in queue, and continued to schedule jobs and it seems this ended up with unique-job-digests being left in a limbo. Since these were jobs we run frequently without parameters it stopped the processing completely for these tasks but no info was available in the sidekiq-admin. Running the code above to clean out unused digests fixed the immediate problem for us, but we now have to consider how to handle these problems in the future, because it is most likely a common situation for us, that the jobrunners gets killed with SIGKILL. |
@soma @simonoff @ccleung @blarralde I am working on fixing it once and for all. It is turning out to be quite an endeavor but i am getting there ever so slowly. |
just an FYI, we were able to reproduce the issue on 6.0.8 |
I am going to close this issue. It has been fixed on master (version 7) that no locks will keep being locked forever. I'd like new fresh issues opened in the future so better be able to track and fix things on the fresh version. I hope you guys don't mind! |
@KevinColemanInc is your problem also related to cron? If not could you open a new issue with all the details regarding your specific case? |
@mhenrixon Could you post a reference to the commit / PR that fixed this issue? |
Describe the bug
New unique jobs can't be added to the queue despite the last one being complete and no other similar job being in the queue. This seems to be related to the keys not being properly cleared.
The jobs in question are from the CRON and ran every minute. The issue happens at random; sometimes spaced by a week, sometimes multiple times in a day.
Our way to go around this is to wipe out all unique keys when this happens.
Worker class
Sidekiq conf
The text was updated successfully, but these errors were encountered: