Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrency issue on MRI #47

Closed
wants to merge 1 commit into from
Closed

Conversation

klobuczek
Copy link
Member

No description provided.

@klobuczek
Copy link
Member Author

@efivash wrote:

I ran into this as well on macOS. It appears to happen when trying to call Neo4j::Driver::DirectConnectionProvider#acquire_connection at the same time in muliple threads, since the threads are sharing the FFI seabolt connector object.

Despite the fact that It looks like active graph tries to put explicit_session and tx in thread-specific variables, ActiveGraphTransactions#send_transaction calls driver.session which gabs a memoized driver that is shared between threads. When the driver tries to build a session for each thread, it does so with the same session factory initiated in the shared driver, which provides the same connection_provider to each session.

I found that these two methods consistently hang, the second method represents the path ActiveGraphTransactions takes when trying to send a transaction if tx and explicit_session are both nil or not open (as they will be in a new thread)

def hang1
   threads = (1..10).map do |_n|
     Thread.new do
       (1..100).each do |_m|
         ActiveGraph::Base.read_transaction {}
       end
     end
   end
   threads.map(&:join)
 end

def hang2
   threads = (1..10).map do |_n|
     Thread.new do
       (1..100).each do |_m|
         ActiveGraph::Base.driver.session.send(:acquire_connection, Neo4j::Driver::AccessMode::READ)
       end
     end
   end
   threads.map(&:join)
end

@klobuczek
Copy link
Member Author

@efivash @mperice I am pretty sure this is a deadlock caused by GIL (Global Interpreter Lock). It's not the first time I see this. The provided spec passes on jruby. We need to brainstorm on this. The connector is meant to be thread-safe and reused for performance reasons. It provides connection pooling. You could try playing with different pool sizes (see

logger: ActiveSupport::Logger.new(STDOUT, level: ::Logger::ERROR), # :set_log
leaked_session_logging: false,
#connection_liveness_check_timeout: -1, # Not configured
max_connection_lifetime: 1.hour, # :set_max_connection_life_time
max_connection_pool_size: 100, #:set_max_pool_size
connection_acquisition_timeout: 1.minute, #:set_max_connection_acquisition_time
encryption: true, # :set_transport
trust_strategy: :trust_all_certificates,
connection_timeout: 30.seconds, # BoltSocketOptions_set_connect_timeout
max_transaction_retry_time: Internal::Retry::ExponentialBackoffRetryLogic::DEFAULT_MAX_RETRY_TIME,
#resolver: nil # :set_address_resolver
keep_alive: true, # BoltSocketOptions_set_keep_alive
# ???? BoltConfig_set_user_agent
). I suspect that there are some staged semaphores in the seabolt library which cause deadlock in case there are no more connections in the pool and one needs to be requested from the server. What advantage are you getting from multiple threads if only one thread can run at any time anyway? We could try putting semaphores all over but this will serialize everything for jruby as well. Maybe you could serialize everything much earlier in the process rather than relying on GIL. Does anyone of you have time to look at the seabolt source code (https://github.com/neo4j-drivers/seabolt)? Neo4j gave up maintenance on that and no help can be expected from their side.

@mperice
Copy link

mperice commented Aug 6, 2020

Hmm, the thing is I first started seeing the connection hang issue when I had only 2 threads (2 sidekiq workers) and the connections hang basically on the first two write transactions (right after starting the sidekiq workers). So I don't think I exceeded the connection pool size. However, I was connecting to a remote neo4j server (average latency of around 40ms). When I tried the same setting on my local neo4j db, I couldn't reproduce the issue with only 2 threads. That is why I created the "synthetic" code with many more threads. It seems that there might be two different issues here.

@efivash
Copy link

efivash commented Aug 7, 2020

Yeah, I don't think I'm hitting the thread pool limit either. I'm hitting this with 2 threads and a pool size of 100.

I don't run into this deadlock when each thread has its own driver, though I'm not sure how useful that information is.

@ekampp
Copy link

ekampp commented Aug 11, 2020

I can confirm that we're also seeing this problem on MRI with three threads running.

@klobuczek
Copy link
Member Author

If anyone could confirm where in the code the deadlocked threads wait that would be very helpful in potentially finding a solution.

@wmene
Copy link

wmene commented Sep 4, 2020

@klobuczek I took @efivash's hang2 method but ran it with 5 threads instead of 10. Here is the thread dump
https://gist.githubusercontent.com/wmene/7935df5d564fae38060cfa04d2c34eb1/raw/16e8a403bf1f0d5a10269d74fda901ed844e1ed9/threaddump.txt

4 threads are waiting for the GVL, one is waiting for a mutex inside libseabolt. The five threads start with (pry):3

Not sure if I'm interpreting this correctly, but the deadlock might be happening because the thread that's waiting for the seabolt mutex has the GVL. The thread that is waiting on the mutex is stuck in the function BoltDirectPool_acquire https://github.com/neo4j-drivers/seabolt/blob/1.7/src/seabolt/src/bolt/direct-pool.c#L234

One issue that is not clear, how did 3 of the threads get past the mutex in the BoltDirectPool_acquire?

@RomainCavrot
Copy link

Hello, I have the same problem. Do you think there is a chance it will be solved? And if so, when?

@sezny
Copy link

sezny commented Dec 28, 2020

Hi ! Any news or temporal fix on this guys ?

@sezny
Copy link

sezny commented Dec 28, 2020

It happens to me on Rails app when Puma has more than 1 thread and when I do multiple requests at the same time.
At some point the server hangs without logging anything and just timeout for any request after that point. (And restart is needed)

So I have set puma to 1 thread for now, but to use sidekiq and restore Puma 5 thread, a fix would be appreciated.
I suppose it has something to do with neo4j or bolt new session / acquire connexion as suggested earlier by @wmene

@joshjordan
Copy link
Contributor

We're running into this as well.

@wmene you're right on regarding the GVL and the mutex. I'm also not sure about those three threads, but the thread that IS waiting on the mutex has the GVL, and the other 3 are waiting to acquire the GVL lock. So, you're dead locked. We have the same problem and its ultimately because neo4j-ruby-driver has ruby code attached to the seabolt logging functions. This is fine most of the time, but is not fine within a mutex. Here's what happens:

  1. Thread A calls a function with a mutex, e.g. BoltDirectPool_acquire in your case and BoltDirectPool_release in mine.
  2. Thread A acquires the mutex
  3. Thread A calls a logging function, e.g. BoltLog_warning in your case and BoltLog_info in mine
  4. seabolt calls the logging function, which uses FFI to call Ruby code, eventually calling gvl_acquire_common thus Thread A ends up waiting for the GVL

Meanwhile:

  1. Thread B has the GVL and also calls BoltDirectPool_acquire
  2. Thread B reaches the mutex in seabolt and waits for Thread A to release it

So Thread A has the mutex and wants the GVL, while Thread B has the GVL and wants the mutex. Deadlock.

Our example is a bit easier to read since its only two threads blocked up, not 4. Check out this gist: https://gist.github.com/joshjordan/a53b01b5feb7a0920efd4c3e8982e450 Here, Thread 10 is A from my example, and Thread 9 is B.

@klobuczek is there any simple way to detach the logging functions before calling the connection pool acquire and release functions? Or does this change need to be made at the seabolt level?

@joshjordan
Copy link
Contributor

joshjordan commented Jan 24, 2021

This is particularly unfortunate because the default behavior is for the neo4j-ruby-driver to default to level: ::Logger:ERROR and doesn't get any overrides from activegraph. So, the processes are hanging/crashing just because neo4j-ruby-driver is allowing seabolt to make no-op calls to info and warn that don't even end up being output :(

On first look, it doesn't seem that its safe to just disable logging during acquire/release, since multiple BoltConnection objects in seabolt may share the BoltCommunication object where the logging callbacks reside. I may be wrong about that, but currently thinking we'll need to pass a boolean through from the connection pool to near the logging calls, to indicate "don't log, we're in a mutex and it might not be safe".

The positive side of the no-op logging statements is that there isn't much impact to disabling logging. For anyone looking to escape this issue short-term, check out my branch at https://github.com/joshjordan/neo4j-ruby-driver/tree/temp-disable-logging-fix-deadlock which disables all logging callbacks from seabolt to neo4j-ruby-driver, since they aren't safe. This does, unfortunately, disable error logging from neo4j-ruby-driver as well, because I have not been able to verify that error is or is not ever called from a mutex-locked code path.

@klobuczek
Copy link
Member Author

Thank you @joshjordan for the spot-on analysis. There is no easy solution to the problem. We would have to implement the logging function in C and make it write to the same file as the ruby logger.

@klobuczek
Copy link
Member Author

The logging has been removed from the ffi version of the driver (1.7). The error should not happen anymore with the caveat of reduced logging.

@klobuczek klobuczek closed this Sep 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants