Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version v5.0.5 might have introduced a breaking change in while_executing and was not documented #230

Closed
gtc18728 opened this issue Jun 14, 2017 · 10 comments

Comments

@gtc18728
Copy link

Hi,

After updating the gem on my server I noticed that the jobs are run differently. It seems that the v5.0.5 and the mutex change from instance to class changed the behaviour and sadly it was not documented.

Previously, when using while_executing I was able to limit unique jobs for the same class with passing different unique_args - meaning some jobs actually might have been run in parallel but with different unique arguments.

For example, a certain User could not run multiple Work jobs, but multiple different Users could run their Work job in parallel. For example (jobs in Sidekiq, enqueued simultaneously):

  1. Work(unique_args: ['user_1']) - this is a job enqueued by user_1
  2. Work(unique_args: ['user_1']) - this is a job enqueued by user_1
  3. Work(unique_args: ['user_2']) - this is a job enqueued by user_2

So in previous version of the gem, jobs 1 and 3 would run in parallel (since unique_args values were not unique), and job 2 would wait for job 1 to end.

After the change (and only I'm assuming that) all Work classes are using the same mutex, therefore only one Work class can be run at a time, meaning that in the above example job 3 would wait for job 1 and/or 2 to finish before actually running.

Is there a way to get the previous approach other than using the version before v5.0.5?

@jianghong
Copy link

This may also be causing some issues that we are seeing. We have a system with very high throughput.

  • Currently on 4.0.17
  • Update to latest version of sidekiq unique jobs
  • Deploy
  • One of our jobs using until_and_while_executing stops enqueued completely. All other jobs are unaffected. This is the only job that uses until_and_while_executing. No other jobs use while_executing either.

Can't reproduce this behaviour in a dev or staging environment because can't replicate the amount of traffic that production gets. I suspect it has something to do with the Mutex lock change in this commit

@kchasel
Copy link

kchasel commented Jun 23, 2017

We also observed a similar issue. After upgrading from 4.0.17 to 5.0.8, our system proceeded with processing only one job at a time, regardless of job class or arguments. We are also using the while_executing approach. According to the documentation in the readme, our setup should have only restricted jobs with identical class names AND arguments from running concurrently.

@michaeldever
Copy link

@jianghong Looks like pulling the mutex to a constant might have broken it? It will have the same value across instances, therefore if it's locked in one, it'll be locked in others. Previous approach was a per instance mutex which makes more sense.

@jianghong
Copy link

Yeah, I think it's a good bet that's the culprit, but I'd like to hear @mhenrixon's thoughts

@glennfu
Copy link

glennfu commented Jul 6, 2017

I don't know if I'm having the exact same problem, but it sounds similar enough, so I figure I'll drop my two cents:

I'm running 5.0.8, with :until_executed, and unique_args: ->(args) { args.first.except('job_id') }. Jobs run across multiple Sidekiq::Queue's. Each client on one queue.

We had a client's server go down, so some of our jobs connecting to them were failing. For some jobs, the job's rescue_from calls `retry_job wait: time, others just fail gracefully. A clock job enqueues frequent tasks that need to be run. Under normal circumstances, even as client servers have spotty connections, the total number of jobs in the production queue stays under 50-100 all day long. This morning we looked and saw the queue was over 50,000. Interestingly only one queue was affected.

We ran some code that searches for duplicates based on args = job.args[0]; key = "#{args["job_class"]}-#{args["arguments"]}" and found that of 57599 jobs, 56079 were duplicates. It seemed like all job classes were affected and severely duplicated.

Like others have said, I can't seem to reproduce this on my own because I can't figure out how to simulate similar load, but there's definitely some scenario under which this gem's duplicate checking seems to break.

@mhenrixon
Copy link
Owner

I'll obviously have to rethink the mutex. The problem with the instance mutex was on the other side of the spectrum. I'll get back to you on this

@mhenrixon
Copy link
Owner

It is extremely hard to test is my biggest problem with it but I was working on that.

@shlajin
Copy link

shlajin commented Jul 12, 2017

Easy way of replicating the issue:

class TestJob
  include Sidekiq::Worker
  sidekiq_options unique: :while_executing,
                  unique_args: :unique_args

  def perform(object = nil)
    return if object.nil?
    item_id = object['item_id']
    user_id = object['user_id']
    puts "Start Test job: #{item_id} / #{user_id}"
    sleep 1
    puts "End Test job: #{item_id} / #{user_id}"
  end

  def self.unique_args(args)
    puts "User is #{args[0]['user_id']}"
    [ args[0]['user_id'] ]
  end
end

from the console:

4.times.map { |i| 2.times.map { |x| TestJob.perform_async({user_id: i, item_id: x}) } }

Output in logs:

2017-07-12T07:28:59.296Z 73394 TID-ovat22ll0 TestJob JID-2eedb9f333ab7febc83ee5a2 INFO: start
2017-07-12T07:28:59.297Z 73394 TID-ovat22mrs TestJob JID-cc98a22de72a7b1dae4edc0a INFO: start
Start Test job: 0 / 0
2017-07-12T07:28:59.299Z 73394 TID-ovat22nng TestJob JID-3e15c5f0c95c6e982a0ef887 INFO: start
2017-07-12T07:28:59.300Z 73394 TID-ovat22ko8 TestJob JID-855dfadf4174022e933e73df INFO: start
2017-07-12T07:28:59.301Z 73394 TID-ovat22jvc TestJob JID-6dbaa46c3239af0f1a0eee57 INFO: start
End Test job: 0 / 0
Start Test job: 1 / 0
2017-07-12T07:29:00.300Z 73394 TID-ovat22ll0 TestJob JID-2eedb9f333ab7febc83ee5a2 INFO: done: 1.004 sec
2017-07-12T07:29:00.301Z 73394 TID-ovat22ll0 TestJob JID-a3345c0dc5278e5c25da9768 INFO: start
End Test job: 1 / 0
2017-07-12T07:29:01.304Z 73394 TID-ovat22mrs TestJob JID-cc98a22de72a7b1dae4edc0a INFO: done: 2.006 sec
2017-07-12T07:29:01.305Z 73394 TID-ovat22mrs TestJob JID-9e4b5dec47ea2731da451366 INFO: start
Start Test job: 0 / 1
End Test job: 0 / 1
Start Test job: 1 / 1
2017-07-12T07:29:02.307Z 73394 TID-ovat22nng TestJob JID-3e15c5f0c95c6e982a0ef887 INFO: done: 3.008 sec
2017-07-12T07:29:02.309Z 73394 TID-ovat22nng TestJob JID-78c047457823bc7b3501a52a INFO: start
End Test job: 1 / 1
Start Test job: 0 / 2
2017-07-12T07:29:03.310Z 73394 TID-ovat22ko8 TestJob JID-855dfadf4174022e933e73df INFO: done: 4.009 sec
End Test job: 0 / 2
2017-07-12T07:29:04.311Z 73394 TID-ovat22jvc TestJob JID-6dbaa46c3239af0f1a0eee57 INFO: done: 5.01 sec
Start Test job: 1 / 2
End Test job: 1 / 2
Start Test job: 0 / 3
2017-07-12T07:29:05.317Z 73394 TID-ovat22ll0 TestJob JID-a3345c0dc5278e5c25da9768 INFO: done: 5.016 sec
End Test job: 0 / 3
Start Test job: 1 / 3
2017-07-12T07:29:06.322Z 73394 TID-ovat22mrs TestJob JID-9e4b5dec47ea2731da451366 INFO: done: 5.017 sec
End Test job: 1 / 3
2017-07-12T07:29:07.328Z 73394 TID-ovat22nng TestJob JID-78c047457823bc7b3501a52a INFO: done: 5.019 sec

@skalb
Copy link

skalb commented Aug 18, 2017

@mhenrixon Out of curiosity, what would be the downside of running the older version using an instance mutex? AFAIK we ran it in production without any issues.

@mhenrixon
Copy link
Owner

mhenrixon commented Aug 20, 2017

@skalb, @shlajin, @jianghong, @glennfu, @michaeldever, @gtc18728, @kchasel, if you wanted to help test the branch better-runtime-locks that contains #241 that would be amazing. I am reworking the while executing lock but before I optimise anything I want to see if it works better for people. I don't run anything in production that is using the uniqueness at the moment and my test application is far from an optimal way of testing with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants