Version v5.0.5 might have introduced a breaking change in while_executing and was not documented #230

gtc18728 · 2017-06-14T10:18:13Z

Hi,

After updating the gem on my server I noticed that the jobs are run differently. It seems that the v5.0.5 and the mutex change from instance to class changed the behaviour and sadly it was not documented.

Previously, when using while_executing I was able to limit unique jobs for the same class with passing different unique_args - meaning some jobs actually might have been run in parallel but with different unique arguments.

For example, a certain User could not run multiple Work jobs, but multiple different Users could run their Work job in parallel. For example (jobs in Sidekiq, enqueued simultaneously):

Work(unique_args: ['user_1']) - this is a job enqueued by user_1
Work(unique_args: ['user_1']) - this is a job enqueued by user_1
Work(unique_args: ['user_2']) - this is a job enqueued by user_2

So in previous version of the gem, jobs 1 and 3 would run in parallel (since unique_args values were not unique), and job 2 would wait for job 1 to end.

After the change (and only I'm assuming that) all Work classes are using the same mutex, therefore only one Work class can be run at a time, meaning that in the above example job 3 would wait for job 1 and/or 2 to finish before actually running.

Is there a way to get the previous approach other than using the version before v5.0.5?

The text was updated successfully, but these errors were encountered:

jianghong · 2017-06-23T15:06:38Z

This may also be causing some issues that we are seeing. We have a system with very high throughput.

Currently on 4.0.17
Update to latest version of sidekiq unique jobs
Deploy
One of our jobs using until_and_while_executing stops enqueued completely. All other jobs are unaffected. This is the only job that uses until_and_while_executing. No other jobs use while_executing either.

Can't reproduce this behaviour in a dev or staging environment because can't replicate the amount of traffic that production gets. I suspect it has something to do with the Mutex lock change in this commit

kchasel · 2017-06-23T18:26:15Z

We also observed a similar issue. After upgrading from 4.0.17 to 5.0.8, our system proceeded with processing only one job at a time, regardless of job class or arguments. We are also using the while_executing approach. According to the documentation in the readme, our setup should have only restricted jobs with identical class names AND arguments from running concurrently.

michaeldever · 2017-07-05T09:43:15Z

@jianghong Looks like pulling the mutex to a constant might have broken it? It will have the same value across instances, therefore if it's locked in one, it'll be locked in others. Previous approach was a per instance mutex which makes more sense.

jianghong · 2017-07-05T13:20:18Z

Yeah, I think it's a good bet that's the culprit, but I'd like to hear @mhenrixon's thoughts

glennfu · 2017-07-06T18:04:33Z

I don't know if I'm having the exact same problem, but it sounds similar enough, so I figure I'll drop my two cents:

I'm running 5.0.8, with :until_executed, and unique_args: ->(args) { args.first.except('job_id') }. Jobs run across multiple Sidekiq::Queue's. Each client on one queue.

We had a client's server go down, so some of our jobs connecting to them were failing. For some jobs, the job's rescue_from calls `retry_job wait: time, others just fail gracefully. A clock job enqueues frequent tasks that need to be run. Under normal circumstances, even as client servers have spotty connections, the total number of jobs in the production queue stays under 50-100 all day long. This morning we looked and saw the queue was over 50,000. Interestingly only one queue was affected.

We ran some code that searches for duplicates based on args = job.args[0]; key = "#{args["job_class"]}-#{args["arguments"]}" and found that of 57599 jobs, 56079 were duplicates. It seemed like all job classes were affected and severely duplicated.

Like others have said, I can't seem to reproduce this on my own because I can't figure out how to simulate similar load, but there's definitely some scenario under which this gem's duplicate checking seems to break.

mhenrixon · 2017-07-06T19:34:40Z

I'll obviously have to rethink the mutex. The problem with the instance mutex was on the other side of the spectrum. I'll get back to you on this

mhenrixon · 2017-07-06T19:35:05Z

It is extremely hard to test is my biggest problem with it but I was working on that.

shlajin · 2017-07-12T07:30:04Z

Easy way of replicating the issue:

class TestJob
  include Sidekiq::Worker
  sidekiq_options unique: :while_executing,
                  unique_args: :unique_args

  def perform(object = nil)
    return if object.nil?
    item_id = object['item_id']
    user_id = object['user_id']
    puts "Start Test job: #{item_id} / #{user_id}"
    sleep 1
    puts "End Test job: #{item_id} / #{user_id}"
  end

  def self.unique_args(args)
    puts "User is #{args[0]['user_id']}"
    [ args[0]['user_id'] ]
  end
end

from the console:

4.times.map { |i| 2.times.map { |x| TestJob.perform_async({user_id: i, item_id: x}) } }

Output in logs:

2017-07-12T07:28:59.296Z 73394 TID-ovat22ll0 TestJob JID-2eedb9f333ab7febc83ee5a2 INFO: start
2017-07-12T07:28:59.297Z 73394 TID-ovat22mrs TestJob JID-cc98a22de72a7b1dae4edc0a INFO: start
Start Test job: 0 / 0
2017-07-12T07:28:59.299Z 73394 TID-ovat22nng TestJob JID-3e15c5f0c95c6e982a0ef887 INFO: start
2017-07-12T07:28:59.300Z 73394 TID-ovat22ko8 TestJob JID-855dfadf4174022e933e73df INFO: start
2017-07-12T07:28:59.301Z 73394 TID-ovat22jvc TestJob JID-6dbaa46c3239af0f1a0eee57 INFO: start
End Test job: 0 / 0
Start Test job: 1 / 0
2017-07-12T07:29:00.300Z 73394 TID-ovat22ll0 TestJob JID-2eedb9f333ab7febc83ee5a2 INFO: done: 1.004 sec
2017-07-12T07:29:00.301Z 73394 TID-ovat22ll0 TestJob JID-a3345c0dc5278e5c25da9768 INFO: start
End Test job: 1 / 0
2017-07-12T07:29:01.304Z 73394 TID-ovat22mrs TestJob JID-cc98a22de72a7b1dae4edc0a INFO: done: 2.006 sec
2017-07-12T07:29:01.305Z 73394 TID-ovat22mrs TestJob JID-9e4b5dec47ea2731da451366 INFO: start
Start Test job: 0 / 1
End Test job: 0 / 1
Start Test job: 1 / 1
2017-07-12T07:29:02.307Z 73394 TID-ovat22nng TestJob JID-3e15c5f0c95c6e982a0ef887 INFO: done: 3.008 sec
2017-07-12T07:29:02.309Z 73394 TID-ovat22nng TestJob JID-78c047457823bc7b3501a52a INFO: start
End Test job: 1 / 1
Start Test job: 0 / 2
2017-07-12T07:29:03.310Z 73394 TID-ovat22ko8 TestJob JID-855dfadf4174022e933e73df INFO: done: 4.009 sec
End Test job: 0 / 2
2017-07-12T07:29:04.311Z 73394 TID-ovat22jvc TestJob JID-6dbaa46c3239af0f1a0eee57 INFO: done: 5.01 sec
Start Test job: 1 / 2
End Test job: 1 / 2
Start Test job: 0 / 3
2017-07-12T07:29:05.317Z 73394 TID-ovat22ll0 TestJob JID-a3345c0dc5278e5c25da9768 INFO: done: 5.016 sec
End Test job: 0 / 3
Start Test job: 1 / 3
2017-07-12T07:29:06.322Z 73394 TID-ovat22mrs TestJob JID-9e4b5dec47ea2731da451366 INFO: done: 5.017 sec
End Test job: 1 / 3
2017-07-12T07:29:07.328Z 73394 TID-ovat22nng TestJob JID-78c047457823bc7b3501a52a INFO: done: 5.019 sec

skalb · 2017-08-18T23:00:36Z

@mhenrixon Out of curiosity, what would be the downside of running the older version using an instance mutex? AFAIK we ran it in production without any issues.

mhenrixon · 2017-08-20T11:35:24Z

@skalb, @shlajin, @jianghong, @glennfu, @michaeldever, @gtc18728, @kchasel, if you wanted to help test the branch better-runtime-locks that contains #241 that would be amazing. I am reworking the while executing lock but before I optimise anything I want to see if it works better for people. I don't run anything in production that is using the uniqueness at the moment and my test application is far from an optimal way of testing with.

mhenrixon mentioned this issue Jul 7, 2017

Deadlock using :while_executing? #233

Closed

mhenrixon mentioned this issue Aug 19, 2017

Various improvements #240

Merged

mhenrixon closed this as completed in #240 Aug 19, 2017

mhenrixon mentioned this issue Aug 20, 2017

Better runtime locks #241

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version v5.0.5 might have introduced a breaking change in while_executing and was not documented #230

Version v5.0.5 might have introduced a breaking change in while_executing and was not documented #230

gtc18728 commented Jun 14, 2017

jianghong commented Jun 23, 2017

kchasel commented Jun 23, 2017

michaeldever commented Jul 5, 2017

jianghong commented Jul 5, 2017

glennfu commented Jul 6, 2017

mhenrixon commented Jul 6, 2017

mhenrixon commented Jul 6, 2017

shlajin commented Jul 12, 2017 •

edited

Loading

skalb commented Aug 18, 2017

mhenrixon commented Aug 20, 2017 •

edited

Loading

Version v5.0.5 might have introduced a breaking change in while_executing and was not documented #230

Version v5.0.5 might have introduced a breaking change in while_executing and was not documented #230

Comments

gtc18728 commented Jun 14, 2017

jianghong commented Jun 23, 2017

kchasel commented Jun 23, 2017

michaeldever commented Jul 5, 2017

jianghong commented Jul 5, 2017

glennfu commented Jul 6, 2017

mhenrixon commented Jul 6, 2017

mhenrixon commented Jul 6, 2017

shlajin commented Jul 12, 2017 • edited Loading

skalb commented Aug 18, 2017

mhenrixon commented Aug 20, 2017 • edited Loading

shlajin commented Jul 12, 2017 •

edited

Loading

mhenrixon commented Aug 20, 2017 •

edited

Loading