Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix on_conflict strategy handling #629

Closed

Conversation

dsander
Copy link
Contributor

@dsander dsander commented Aug 2, 2021

Hi! 👋

we found a bug related to the on_conflict option. The change fixes it for our use case and hopefully as well for the rest of the uniqueness strategies.

This fixes the exception posted below when using on_conflict: replace
and potentially other bugs related to on_conflict. Before this change
BasicLock.lock always returned either the acquired lock or the return
value of call_strategy. Since Middleware::Client#lock now only
yields when lock_instance.lock yields we have to also yield inside
the lock instances when the lock was acquired via lock_failed (i.e. a
on_conflict strategy).

NoMethodError: undefined method `key?' for "f5d69f8fd2e1f3dde8cee02e":String
  from sidekiq/client.rb:197:in `atomic_push'
  from sidekiq/client.rb:190:in `block (2 levels) in raw_push'
  from redis.rb:2489:in `block in multi'
  from redis.rb:69:in `block in synchronize'
  from monitor.rb:202:in `synchronize'
  from monitor.rb:202:in `mon_synchronize'
  from redis.rb:69:in `synchronize'
  from redis.rb:2483:in `multi'
  from sidekiq/client.rb:189:in `block in raw_push'
  from connection_pool.rb:63:in `block (2 levels) in with'
  from connection_pool.rb:62:in `handle_interrupt'
  from connection_pool.rb:62:in `block in with'
  from connection_pool.rb:59:in `handle_interrupt'
  from connection_pool.rb:59:in `with'
  from sidekiq/client.rb:188:in `raw_push'
  from sidekiq/client.rb:74:in `push'
  from sidekiq/worker.rb:240:in `client_push'
  from sidekiq/worker.rb:215:in `perform_in'

#590

@dsander dsander closed this Aug 2, 2021
@dsander dsander reopened this Aug 2, 2021
@dsander dsander force-pushed the feature/fix-on-conflict-handling branch 2 times, most recently from 3687a7c to 34182dd Compare August 2, 2021 16:28
@dsander
Copy link
Contributor Author

dsander commented Aug 2, 2021

Sorry, I have no idea why the specs are failing on CI. They pass locally using ruby 2.6.6

Copy link
Owner

@mhenrixon mhenrixon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reporting the issue and taking the time to try and fix it.

I believe this is not the fix that we need though. It is mixing up the concepts a bit.

This is likely the fault of the documentation (which leaves a lot to be wished for, sorry about that).

I'll take some time to try and understand your problem later. I do believe you are misusing the strategy.

@@ -23,8 +23,8 @@ class UntilAndWhileExecuting < BaseLock
# @yield to the caller when given a block
#
def lock(origin: :client)
return lock_failed(origin: origin) unless (token = locksmith.lock)
return yield token if block_given?
token = locksmith.lock || lock_failed(origin: origin)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replace strategy should only be used for either the server or the client process if I remember correctly.

Secondly, the strategy should run instead of the lock. Any return value from the strategy is coincidental and should not cause the job to execute which this implementation does.

In other words, this is not the right solution.

Copy link
Owner

@mhenrixon mhenrixon Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had a look now, you can get this information if you validate your worker.

expect(MyWorker).to have_valid_sidekiq_options
sidekiq_options lock: :until_and_while_executing, 
                on_conflict: { 
                  client: :replace, 
                  server: :reschedule 
                }

Or if you rather raise and let the job be retried when it is able to get the lock:

sidekiq_options lock: :until_and_while_executing, 
                on_conflict: { 
                  client: :replace, 
                  server: :raise 
                }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I messed up the formatting of the commit message. For us this is a regression caused by 8c8d54c. Before this change the test I added worked without an exception and behaved like we expected. We might have used a non-feature that wasn't intended to work. Since we are using :until_and_while_executing I don't think that the server on_conflict strategy will ever trigger. We are scheduling jobs on the client side to basically de-bounce events. For a group of events we only want one job to run X seconds after the last of the grouped events fired. This worked fine before and should never need a server on_conflict strategy because the jobs are scheduled to run in the future and should always only have one job per lock_args in the queue.

The exception I posted happens because Middleware::Client#lock lock_instance.lock now doesn't yield but returns the return value of BaseLock#lock_failed when the job was locked by strategy. This breaks the Sidekiq middleware interface because the middleware is supposed to either yield or return an item or nil/false. I believe the change restores the previous behavior if yielding inside lock when a strategy was successful:

return call_strategy unless (locked_token = locksmith.lock(&block))

I just saw the current documentation of lock_failed which mentions it's supposed to return void, but I belive it still returns what it did before (String, nil) so the change might work for our use case by accident 😄

Maybe the proper fix would be to pass the block into the on_conflict strategies and yield there if they were successful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through the code again and i now believe the return value of OnConflict::Strategy#call isn't documented correctly. It should have the same as BaseLock#lock. I think the change works (for our use case) because for on_conflict: replace, lock_failed basically behaves like lock:

def lock_failed(origin: :client)

client_strategy.call { lock if replace? }

Which then calls BaseLock#lock again and returns String, nil.

WDYT about changing the return type of the strategies to String, nil and make all but replace return nil explicitly instead of implicitly like it's done for most of them currently?

@Intrepidd
Copy link

👋 FYI I am having the exact same issue.

Using the suggested fix doesn't change anything, same error :

on_conflict: { 
  client: :replace, 
  server: :reschedule 
}

@dsander
Copy link
Contributor Author

dsander commented Sep 1, 2021

@Intrepidd I just updated the specs with your (and the generally suggested on_conflict server side configuration and it works without issues. Can you post the exception you are getting?

@Intrepidd
Copy link

Intrepidd commented Sep 1, 2021

Thanks, that's odd.

The exception I get is the same : NoMethodError: undefined method 'key?' for "xxxxxxxxxxxx":String
Unfortunately I don't have a full backtrace, do you know how I could get one ? (Standard rails app)

Maybe worth noting : I'm enqueuing the job in the future with perform_in, effectively (trying to) using sidekiq unique jobs as a debouncer.

@TheSmartnik
Copy link

@dsander I'm having a similar issue as well with the same backtrace as in the original post

To Reproduce:

Versions:
sidekiq (6.2.2)
sidekiq-unique-jobs (7.1.5)


class MyWorker
  include Sidekiq::Worker
  sidekiq_options lock: :until_and_while_executing, on_conflict: :reject, retry: false # same with { server: :reject, client: :reject }

  def perform(id)
  end
end
irb(main):001:0> MyWorker.perform_at(3.hours.from_now, 1)
=> "7c7a3f1c1bb909802c58efba"
irb(main):002:0> MyWorker.perform_at(3.hours.from_now, 1)
2021-09-08T12:22:08.816Z pid=22291 tid=fgb uniquejobs=client until_and_while_executing=uniquejobs:0d1124b8750bfa013beb1f52f6b608a7 INFO: Adding dead MyWorker job 819e5093ad0174935916b0d4
Traceback (most recent call last):
        1: from (irb):2
NoMethodError (undefined method `key?' for true:TrueClass)

Backtrace is exactly the same. It fails in here https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/client.rb#L196

@dsander dsander force-pushed the feature/fix-on-conflict-handling branch from 2337952 to 2bea226 Compare September 15, 2021 09:55
@dsander
Copy link
Contributor Author

dsander commented Sep 15, 2021

@Intrepidd @TheSmartnik I have added tests for both of your use cases and the pass locally, I still have no clue why CI fails though.

@dsander dsander force-pushed the feature/fix-on-conflict-handling branch 3 times, most recently from d1762f9 to e1610b5 Compare September 15, 2021 11:12
@dsander
Copy link
Contributor Author

dsander commented Sep 15, 2021

That took some digging, turned out that Sidekiq 6.2.2 broke the specs.,

@mhenrixon
Copy link
Owner

If you all upgrade to sidekiq-unique-jobs v7.1.6 you should be golden.

Sorry about the extra long response time, I have had a rough couple of months.

@dsander dsander force-pushed the feature/fix-on-conflict-handling branch from e1610b5 to f7cce0e Compare September 26, 2021 09:27
This fixes the exception posted below when using `on_conflict: replace`
and potentially other bugs related to `on_conflict`. Before [this change][1]
`BasicLock.lock` always returned either the acquired lock or the return
value of `call_strategy`. Since `Middleware::Client#lock` now only
yields when `lock_instance.lock` yields we have to also `yield` inside
the lock instances when the lock was acquired via `lock_failed` (i.e. a
`on_conflict` strategy).

```
NoMethodError: undefined method `key?' for "f5d69f8fd2e1f3dde8cee02e":String
  from sidekiq/client.rb:197:in `atomic_push'
  from sidekiq/client.rb:190:in `block (2 levels) in raw_push'
  from redis.rb:2489:in `block in multi'
  from redis.rb:69:in `block in synchronize'
  from monitor.rb:202:in `synchronize'
  from monitor.rb:202:in `mon_synchronize'
  from redis.rb:69:in `synchronize'
  from redis.rb:2483:in `multi'
  from sidekiq/client.rb:189:in `block in raw_push'
  from connection_pool.rb:63:in `block (2 levels) in with'
  from connection_pool.rb:62:in `handle_interrupt'
  from connection_pool.rb:62:in `block in with'
  from connection_pool.rb:59:in `handle_interrupt'
  from connection_pool.rb:59:in `with'
  from sidekiq/client.rb:188:in `raw_push'
  from sidekiq/client.rb:74:in `push'
  from sidekiq/worker.rb:240:in `client_push'
  from sidekiq/worker.rb:215:in `perform_in'
```

 mhenrixon#590

[1]: s://github.com/mhenrixon/sidekiq-unique-jobs/commit/8c8d54c8b9dea363a7d8b8aeaceb2e82966b8503
When using the `on_conflict` `replace` strategy `base_lock` has to
return the return value of the conflict strategy. The replace strategy
removes the previous job from the scheduled/rety set or the queue.
It is then re-queued inside `lock_failed`/`call_strategy`, thus we have
to return the return value of which came from `lock` inside the block of
` client_strategy.call`.
@dsander dsander force-pushed the feature/fix-on-conflict-handling branch from 31195e9 to b3255e1 Compare September 26, 2021 09:36
@dsander
Copy link
Contributor Author

dsander commented Sep 26, 2021

@mhenrixon Thanks! That allowed me to get rid of the sidekiq version lock, while it "fixed" the exception, it still looks like it did not fix the on_conlifct: :replace client strategy: https://github.com/projectivetech/sidekiq-unique-jobs/runs/3712228560?check_suite_focus=true#step:6:22

I removed [this change]((v7.1.5...v7.1.6#diff-3744df1729b19fd14c4d3ff1cbedc47bb8c3cc13131595a4ec579e03a2713fc8R112) in my last commit which made the specs pass again.

mhenrixon added a commit that referenced this pull request Sep 27, 2021
Close #629

The problem originally was that in some situations, a string was returned (for the lock). This should never happen, any conflict strategy must return nil to avoid pretending to be successful.
@mhenrixon
Copy link
Owner

@dsander could you check if #640 fixes the issue for you? The problem I see with your code is that you actually changed the internal API to be invalid. There should be no change from [void] to [String, void]. The API is expected to support void or nil only.

@Intrepidd
Copy link

Hi ! Thanks for looking into it !

On my end I don't see the error again but am facing an odd behaviour :

I'm using on_conflict: :replace but when queuing a conflicting job, the previous one just disappears from the queue.

As mentionned before, I'm using perform_in so I am talking about the scheduled queue

@mhenrixon
Copy link
Owner

@Intrepidd that's expected, the replace strategy deletes the old job and pushes the new one in. Which is why it should only be used with on_conflict: { server: :client }.

@Intrepidd
Copy link

Intrepidd commented Sep 27, 2021

I'm sorry I wasn't clear.

No job end up in the queue, leaving it empty

And I am using

on_conflict: {
      client: :replace,
      server: :reschedule
    }

@dsander
Copy link
Contributor Author

dsander commented Sep 27, 2021

@Intrepidd @mhenrixon This happens because in #640 the middleware always returns nil. From the sidekiq middleware docs:

Not calling yield or returning anything other than job will result in no further middleware being called and the job will not be pushed to the queue. (Note that yield will return an equivalent value to job)

@mhenrixon
Copy link
Owner

Hey sorry about that @Intrepidd and @dsander. I was mistaking replace for reschedule. I'll see if I can replicate this with a middleware test first :)

@dsander
Copy link
Contributor Author

dsander commented Sep 27, 2021

@mhenrixon You can use those: projectivetech@23dc652. Not sure how well they fit into the test suite though, I always like to test things from the outside when I am not super familiar with a library 😄

mhenrixon added a commit that referenced this pull request Sep 27, 2021
* OnConflict, return nil

Close #629

The problem originally was that in some situations, a string was returned (for the lock). This should never happen, any conflict strategy must return nil to avoid pretending to be successful.

* Refactor (thanks reek for pointing it out)

* Ensure the replace strategy yields

This should be considered a success

* Adds documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants