Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoloader racing on multiple threads #3040

Open
itarato opened this issue May 5, 2023 · 7 comments
Open

Autoloader racing on multiple threads #3040

itarato opened this issue May 5, 2023 · 7 comments

Comments

@itarato
Copy link
Collaborator

itarato commented May 5, 2023

Autoloader use with multiple threads cause premature use of the class.

Example

When executing mutliple Rails tests with the debugger: jt --use jvm-ce ruby --jdebug -S ./bin/rails test test/folder the execution fails with:

#<Thread:0x390d8 @active_support_execution_state={:active_support_execution_context=>{}} /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/minitest-5.18.0/lib/minitest/parallel.rb:28 run> terminated with exception (report_on_exception is true):
/home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/railtie.rb:54:in `block (2 levels) in <class:Railtie>': undefined method `reset_all' for ActiveSupport::CurrentAttributes:Class (NoMethodError)
Did you mean?  reset_callbacks
	from /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/callbacks.rb:445:in `instance_exec'
	from /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/callbacks.rb:445:in `block in make_lambda'
	from /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/callbacks.rb:199:in `block (2 levels) in halting'
	from /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/callbacks.rb:687:in `block (2 levels) in default_terminator'
	from <internal:core> core/throw_catch.rb:36:in `catch'
	from /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/callbacks.rb:686:in `block in default_terminator'

The test executor spawns threads to spread the test execution, and most threads fail with this error (not all). What's happening is the very first thread is referring to ActiveSupport::CurrentAttributes, an unknown constant. The autoload definition in https://github.com/rails/rails/blob/2bfb6565a4996a19065e03300d9061e9eb265d92/activesupport/lib/active_support.rb#L42 delegates the resolution to Zeitwerk and it starts loading the class (snippet from the first thread load backtrace):

/home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/current_attributes.rb:4:in `/'
/home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/current_attributes.rb:4:in `<top (required)>'
<internal:core> core/kernel.rb:234:in `gem_original_require'
/home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/zeitwerk-2.6.8/lib/zeitwerk/kernel.rb:38:in `require'
/home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/railtie.rb:49:in `block (2 levels) in <class:Railtie>'

Meanwhile the other threads will just use the class as if it's already loaded - which is not the case, the class's Ruby code is still executing (function definitions) so some threads won't have all the methods available: hence the error. This was confirmed by logging file execution start/end.

Reproduce

  • make a new rails project with truffle ruby
  • add a bunch of scaffolds, eg: rails generate scaffold PostA name:string title:string content:text value:integer
  • call tests with --jdebug

Minimum example

Unfortunately reproducing it with a minimum example was tricky. It seems using Module#autolad vs ActiveSupport::Autoload#autoload was the same. I couldn't replicate it with --jvm-ce but saw something similar with --native.

Code

# Bait file.

class Foo
	sleep(0.1)

	def bar; end
end
require("thread")

autoload("Foo", "./foo.rb")
thread = [nil] * 8

thread[0] = Thread.new { Foo.new.bar }
thread[1] = Thread.new { Foo.new.bar }
thread[2] = Thread.new { Foo.new.bar }
thread[3] = Thread.new { Foo.new.bar }
thread[4] = Thread.new { Foo.new.bar }
thread[5] = Thread.new { Foo.new.bar }
thread[6] = Thread.new { Foo.new.bar }
thread[7] = Thread.new { Foo.new.bar }

thread.each(&:join)

Example output (6 failure, 2 success):

#<Thread:0x148 try_break_autoload.rb:26 run> terminated with exception (report_on_exception is true):
try_break_autoload.rb:26:in `block in <main>': undefined method `bar' for #<Foo:0x158> (NoMethodError)
#<Thread:0x138 try_break_autoload.rb:25 run> terminated with exception (report_on_exception is true):
try_break_autoload.rb:25:in `block in <main>': undefined method `bar' for #<Foo:0x168> (NoMethodError)

Observation

  • Rails on CRuby does not run tests on multiple threads
  • CRuby and Truffle with --jvm-ce does not reproduce the error with the snippet above
  • I'm using --jdebug on the Rails test run because without it there is a different racing error (tracked separately), probably the debugger slows it down enough so the first error does not occur that often
  • this error shows with/without compilation (--engine.Compilation=false)
@itarato itarato self-assigned this May 5, 2023
@itarato itarato added the shopify label May 5, 2023
@eregon
Copy link
Member

eregon commented May 5, 2023

I think you are seeing #2431.
That issue has more details on what's currently done and what's missing.
The problem is we do synchronize so only one thread autoloads, but we also publish the constant as soon as it's set which means other threads can see it at that point, before the file is executed.
I guess when the autoload constant is set we should consider it set only for the current thread doing the file loading, and then publish it when the current thread finishes loading that file.

Thanks for the detailed issue and reproduction.

@itarato itarato removed the shopify label May 8, 2023
@itarato
Copy link
Collaborator Author

itarato commented May 9, 2023

Didn't get too far, still got confused by the nodes layout in execution. I've used the following code for debugging (safe enough to avoid racing but still threaded so the same-thread-locking guards can be tested):

require("thread")

autoload("Foo", "./foo.rb")

Foo.new.bar

Thread.new { Foo.new.bar }.join

At the first Foo.new.bar on the main thread it loads LookupConstantWithLexicalScopeNode which calls GetConstantNode. This is where Object has the constant field for Foo with autoload config. Here it starts the require:

callRequireNode.call(coreLibrary().mainObject, "require", feature);
This goes down the stack and calls GetConstantNode again (couldn't figure out why it needs a second one). This eventually replaces the constant entry for Object:
} while (!ConcurrentOperations.replace(constants, name, previousEntry, new ConstantEntry(newConstant)));
. And now this can return to the parent GetConstantNode and resolve the autoload lock:

Now when we go to the second lookup for Thread.new { Foo.new.bar }.join the LookupConstantWithLexicalScopeNode returns the new constant declaration for Foo which will just use the final Foo:

Only had vague ideas how to approach it:

  • For final constants, when replacing the autoloaded ones for a module, a property could be set indicating that it's not completed (loaded). Once GetConstantNode is completed and locks are resolved, this property could be switched (to completed=true). One small challenge is that the autoloader in GetConstantNode would need to look this up to update it. And a bigger challenge is what to do with the other const loader that waits for the in-progress original load to finish. Busy sleep? Semaphore?

I'm looking for some guidance on a better approach here.

@eregon
Copy link
Member

eregon commented May 30, 2023

One idea is for the constant being autoloaded to not set it yet but either store it in a new field of AutoloadConstant or maybe in the RubyConstant#value of the autoload RubyConstant.
And then we can probably reuse the existing autoload locks, so as long as the other thread sees it's not set yet it will try to acquire the autoload lock and wait, so that's fine.
And we have AutoloadConstant#isAutoloadingThread() to know which thread is allowed to see the value.

@eregon eregon self-assigned this May 30, 2023
@eregon
Copy link
Member

eregon commented May 30, 2023

A reliable reproducer:

# main.rb
autoload :C, "#{__dir__}/c.rb"

$inside = false

t = Thread.new { p C.new.foo }
Thread.pass until $inside
p C.new.foo

t.join
# c.rb
class C
  def bar
  end

  $inside = true
  sleep 1

  def foo
    42
  end
end

@eregon
Copy link
Member

eregon commented May 30, 2023

I tried an edge case, what if while loading the file we create another thread which tries to access the being-autoloaded constant? Answer: it deadlocks on CRuby:

# c.rb
class C
  def bar
  end

  Thread.new { p C }.join

  $inside = true
  sleep 1

  def foo
    42
  end
end

So it's simple, we don't need to handle that case, it can just hang.

If we simplify main.rb we can get CRuby to print the deadlock:

# main.rb
autoload :C, "#{__dir__}/c.rb"
p C.new.foo

gives

/home/eregon/code/truffleruby-ws/truffleruby/autoload/c.rb:5:in `join': No live threads left. Deadlock? (fatal)
2 threads, 2 sleeps current:0x0000000001bdfae0 main thread:0x0000000001b03060
* #<Thread:0x00007f26bfe1fcc0 sleep_forever>
   rb_thread_t:0x0000000001b03060 native:0x00007f26bff40740 int:0
   
* #<Thread:0x00007f26bfdd8eb0 /home/eregon/code/truffleruby-ws/truffleruby/autoload/c.rb:5 sleep_forever>
   rb_thread_t:0x0000000001bdfae0 native:0x00007f26af30b6c0 int:0
    depended by: tb_thread_id:0x0000000001b03060
   
	from /home/eregon/code/truffleruby-ws/truffleruby/autoload/c.rb:5:in `<class:C>'
	from /home/eregon/code/truffleruby-ws/truffleruby/autoload/c.rb:1:in `<top (required)>'
	from <internal:/home/eregon/.rubies/ruby-3.1.3/lib/ruby/3.1.0/rubygems/core_ext/kernel_require.rb>:85:in `require'
	from <internal:/home/eregon/.rubies/ruby-3.1.3/lib/ruby/3.1.0/rubygems/core_ext/kernel_require.rb>:85:in `require'
	from main.rb:2:in `<main>'

@eregon
Copy link
Member

eregon commented May 30, 2023

With Fibers on 3.2, with Fiber.new { p C }.resume instead of Thread.new { p C }.join it's:

c.rb:5:in `block in <class:C>': deadlock; lock already owned by another fiber belonging to the same thread (ThreadError)

So just an early error but still a deadlock.
On 3.1 it seems to use a per-Ruby-Thread lock and it "works", but let's use 3.2 semantics it's far easier for TruffleRuby.

graalvmbot pushed a commit that referenced this issue May 31, 2023
* We stash the value of the constant being autoloaded in AutoloadConstant
  and only publish it once the autoload ends.
* Fixes #2431
  and #3040
@eregon
Copy link
Member

eregon commented May 31, 2023

I have a fix for this in #3078, I'm still polishing it a bit.
Notably there is the case of module M; autoload :OpenSSL, "openssl"; p OpenSSL; end I need to review.

I also noticed we still have the notion of "undefined constants", but this should have been removed in CRuby 3.1, so now we should be able to clean that up as well. But probably best done separately, it's already quite a complex change as-is.

graalvmbot pushed a commit that referenced this issue Jun 9, 2023
* We stash the value of the constant being autoloaded in AutoloadConstant
  and only publish it once the autoload ends.
* Fixes #2431
  and #3040
graalvmbot pushed a commit that referenced this issue Jun 12, 2023
* We stash the value of the constant being autoloaded in AutoloadConstant
  and only publish it once the autoload ends.
* Fixes #2431
  and #3040
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants