Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruby locking is per-fiber, not per-thread. #962

Closed
ioquatix opened this issue Oct 17, 2022 · 24 comments · Fixed by #983 or #990
Closed

Ruby locking is per-fiber, not per-thread. #962

ioquatix opened this issue Oct 17, 2022 · 24 comments · Fixed by #983 or #990
Assignees

Comments

@ioquatix
Copy link
Contributor

ioquatix commented Oct 17, 2022

Ruby mutex implementation is per-fiber and generally speaking, all locking mechanisms should be per-fiber for the purpose of mutual exclusion.

As such, it appears at least some parts of concurrent ruby might not be correct, according to this specification.

I've been looking at

class ReentrantReadWriteLock < Synchronization::Object
because we had some reports here socketry/falcon#166

Specifically

      "kind":"Concurrent::IllegalOperationError",
      "message":"Cannot release a read lock which is not held",

It appears that multi-fiber re-entrancy is not supported or handled correctly, or used incorrectly.

I'm not sure what the correct solution is here, but at least we can start having a discussion about how it should work and what the solution is.

@chrisseaton
Copy link
Member

This is since Ruby 3.0 isn't it?

Not massively looking to working through the significance of this change...

Would you be willing to start by working with me to redefine what you think the specs should be now?

@ioquatix
Copy link
Contributor Author

IIRC, different versions of Ruby did different things and 3.0 standardised the behaviour. On some platforms where fibers were backed by threads, mutex was already per-fiber.

I checked the implementation a little bit and it looks okay to me, you are using Mutex correctly.

Maybe it's a genuine usage error on the part of rails, let me check their code.

@ioquatix
Copy link
Contributor Author

I've updated the original discussion of this issue, because I'm no longer sure that the issue is with this code.

@ioquatix
Copy link
Contributor Author

This seems to reproduce the original issue, at least the error is the same:

require 'concurrent'

lock = Concurrent::ReentrantReadWriteLock.new
lock.acquire_read_lock

puts "Lockity lock!"

Thread.new do
  lock.release_read_lock
end.join

The question is, is this what's happening?

@chrisseaton
Copy link
Member

Cannot release a read lock which is not held

That thread doesn't hold the lock so can't release it.

@ioquatix
Copy link
Contributor Author

Yes, I understand that, but why is it not happening on the same thread?

@chrisseaton
Copy link
Member

Are you saying you're aware that the last code snippet you posted is broken - you're saying that something else in some other code is making the same mistake as this example code, and that's what you're trying to fix?

Ok I think we're on the same page now. Yeah sounds like an application bug.

You could probably add some error-handling code to say which thread is holding it and which is trying to release it (might take some extra book-keeping) - I'd try that.

@ioquatix
Copy link
Contributor Author

Sorry about my lack of clarity, I don't really know what's going on fully yet.

Are you saying you're aware that the last code snippet you posted is broken - you're saying that something else in some other code is making the same mistake as this example code, and that's what you're trying to fix?

Yes, I'm trying to reproduce possible scenarios which could create the behaviour I'm seeing so I know what to look for.

Ok I think we're on the same page now. Yeah sounds like an application bug.

Yes, maybe in Rails or maybe the way Rails is being invoked by Falcon.

You could probably add some error-handling code to say which thread is holding it and which is trying to release it (might take some extra book-keeping) - I'd try that.

Yes, that's the plan. I don't have a full repro, but I might try hacking something together.

@ioquatix
Copy link
Contributor Author

Okay, I reviewed all the code and was able to make what appears to be a repro:

#!/usr/bin/env ruby

require 'concurrent'
require 'async'

lock = Concurrent::ReentrantReadWriteLock.new
Q = 0.01

watcher = Thread.new do
  loop do
    lock.with_write_lock do
      sleep Q
    end
    sleep Q
  end
end

Async do |task|
  2.times do |i|
    task.async do
      loop do
        lock.with_read_lock do
          sleep Q
        end
        sleep Q
      end
    end
  end
end

Results in:

0>1>0-1-  0.0s     warn: Async::Task [oid=0xa0] [ec=0xb4] [pid=857111] [2022-10-18 13:43:20 +1300]
               | Task may have ended with unhandled exception.
               |   Concurrent::IllegalOperationError: Cannot release a read lock which is not held
               |   → /home/samuel/.gem/ruby/3.1.2/gems/concurrent-ruby-1.1.10/lib/concurrent-ruby/concurrent/atomic/reentrant_read_write_lock.rb:244 in `release_read_lock'
               |     /home/samuel/.gem/ruby/3.1.2/gems/concurrent-ruby-1.1.10/lib/concurrent-ruby/concurrent/atomic/reentrant_read_write_lock.rb:130 in `with_read_lock'
               |     bug.rb:23 in `block (4 levels) in <main>'
               |     bug.rb:21 in `loop'
               |     bug.rb:21 in `block (3 levels) in <main>'
               |     /home/samuel/.gem/ruby/3.1.2/gems/async-2.2.1/lib/async/task.rb:107 in `block in run'
               |     /home/samuel/.gem/ruby/3.1.2/gems/async-2.2.1/lib/async/task.rb:243 in `block in schedule'

After that the lock is totally jammed up.

@ioquatix
Copy link
Contributor Author

Okay, I think I have an idea of the problem.

The thread local variable is per-thread, while the synchronisation is per fiber. I think this must be leading to some kind of invalid state.

@ioquatix
Copy link
Contributor Author

Okay, that appears to be the problem. The fix in this case is to use fiber local variables:

require 'thread'
require 'concurrent/atomic/abstract_thread_local_var'

module Concurrent

  # @!visibility private
  # @!macro internal_implementation_note
  class RubyThreadLocalVar < AbstractThreadLocalVar
    def initialize(...)
      super
      @key = :"concurrent-ruby-#{object_id}"
    end

    def value
      Thread.current[@key] || default
    end

    def value=(value)
      Thread.current[@key] = value
    end

    def allocate_storage
      # No-op.
    end
  end
end

I just monkey patched this into my local gem and it fixed the issue.

The current implementation of thread locals looks a bit... over engineered? Maybe it was that way for backwards compatibility?

In any case, the locking is per fiber, and the state used for tracking that locking is per-thread, so any time you mix the two, it's bound to eventually end in disaster. The solution is to use per-fiber state for the lock implementation.

@ioquatix
Copy link
Contributor Author

@chrisseaton any thoughts on the next steps?

@chrisseaton
Copy link
Member

I'm a bit worried about how deep this goes.

Yes likely complexity is about backwards-compatibility. I'd like to add a simple core set of primitives to MRI for the long-term but that's another discussion...

the locking is per fiber, and the state used for tracking that locking is per-thread

That's probably the central thing to think about. I'd like to be able to write a spec that we want to meet. Would you be able to help me with that? Even just the plain-English description of the behaviour you want.

@ioquatix
Copy link
Contributor Author

ioquatix commented Nov 10, 2022

Even just the plain-English description of the behaviour you want.

(It's not what I want, it's what's defined by CRuby).

Mutex locking is per-fiber.

Therefore, if you are storing state "per lock" you need to store it per fiber.

That's what the above monkey patch does.

@ioquatix
Copy link
Contributor Author

@eregon can I please pull your expertise into this discussion.

@eregon
Copy link
Collaborator

eregon commented Nov 10, 2022

I think for ReentrantReadWriteLock and others which use ThreadLocalVar, we should use FiberLocalVar (to be added, #376 (comment)) if Mutex is per Fiber.
That can be easily detected with

m = Mutex.new
mutex_owned_per_thread = m.synchronize { Fiber.new { m.owned? }.resume }

I think then the compatibility concerns are non-existent because using a ThreadLocalVar for per-Mutex-state on 3.0+ is just a bug.

@chrisseaton
Copy link
Member

I don't have any objection to this fix if that's what you're thinking by the way. Just moving cautiously on a subject that is usually more complex than it appears.

@eregon
Copy link
Collaborator

eregon commented Jan 23, 2023

@ioquatix
Copy link
Contributor Author

Thanks!

@hoprocker
Copy link

Hey folks, we just came across an issue with v1.2.0 in JRuby that seems to be related to this. Our code uses ReentrantReadWriteLock; it worked fined under v1.1.10, but in v1.2.0 we're now seeing this error:

undefined method `current' for Fiber:Class
Did you mean?  __current__

rubygems/concurrent-ruby-1.2.0/lib/concurrent-ruby/concurrent/atomic/locals.rb:180:in `locals!'"
rubygems/concurrent-ruby-1.2.0/lib/concurrent-ruby/concurrent/atomic/locals.rb:102:in `set',
rubygems/concurrent-ruby-1.2.0/lib/concurrent-ruby/concurrent/atomic/fiber_local_var.rb:77:in `value=',
rubygems/concurrent-ruby-1.2.0/lib/concurrent-ruby/concurrent/atomic/reentrant_read_write_lock.rb:205:in `acquire_read_lock',
rubygems/concurrent-ruby-1.2.0/lib/concurrent-ruby/concurrent/atomic/reentrant_read_write_lock.rb:128:in `with_read_lock',
...

link

At first glance it looks like this could be mitigated by switching from Fiber.current to Thread.current with some sort of environment evaluation, ie

current_ctx = (RUBY_PLATFORM == 'java' ? Thread.current : Fiber.current)
ObjectSpace.define_finalizer(current_ctx, thread_fiber_finalizer(locals.object_id))

However, I have no idea how much deeper this goes, if this would need to be swapped out in multiple places, etc.

ioquatix added a commit that referenced this issue Feb 20, 2023
@ioquatix
Copy link
Contributor Author

The above PR should fix this issue.

@hoprocker
Copy link

Somehow totally missed that, thank you!

eregon pushed a commit that referenced this issue Feb 22, 2023
@eregon
Copy link
Collaborator

eregon commented Feb 24, 2023

The missing require only affects Ruby <= 3.0, 3.1+ always have Fiber.current defined.
And the reason we missed this is rake spec:isolated is only run 3.2 currently in CI, I'll also run it on the oldest Ruby we support to help catch this.

@eregon
Copy link
Collaborator

eregon commented Feb 24, 2023

Released in https://github.com/ruby-concurrency/concurrent-ruby/releases/tag/v1.2.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants