concurrency issue on MRI #47

klobuczek · 2020-08-06T20:50:41Z

No description provided.

klobuczek · 2020-08-06T20:57:14Z

I ran into this as well on macOS. It appears to happen when trying to call Neo4j::Driver::DirectConnectionProvider#acquire_connection at the same time in muliple threads, since the threads are sharing the FFI seabolt connector object.

Despite the fact that It looks like active graph tries to put explicit_session and tx in thread-specific variables, ActiveGraphTransactions#send_transaction calls driver.session which gabs a memoized driver that is shared between threads. When the driver tries to build a session for each thread, it does so with the same session factory initiated in the shared driver, which provides the same connection_provider to each session.

I found that these two methods consistently hang, the second method represents the path ActiveGraphTransactions takes when trying to send a transaction if tx and explicit_session are both nil or not open (as they will be in a new thread)
def hang1
   threads = (1..10).map do |_n|
     Thread.new do
       (1..100).each do |_m|
         ActiveGraph::Base.read_transaction {}
       end
     end
   end
   threads.map(&:join)
 end

def hang2
   threads = (1..10).map do |_n|
     Thread.new do
       (1..100).each do |_m|
         ActiveGraph::Base.driver.session.send(:acquire_connection, Neo4j::Driver::AccessMode::READ)
       end
     end
   end
   threads.map(&:join)
end

klobuczek · 2020-08-06T21:21:56Z

@efivash @mperice I am pretty sure this is a deadlock caused by GIL (Global Interpreter Lock). It's not the first time I see this. The provided spec passes on jruby. We need to brainstorm on this. The connector is meant to be thread-safe and reused for performance reasons. It provides connection pooling. You could try playing with different pool sizes (see

neo4j-ruby-driver/ffi/neo4j/driver/config.rb

Lines 22 to 34 in 01fab91

    
           logger: ActiveSupport::Logger.new(STDOUT, level: ::Logger::ERROR), # :set_log 
        
           leaked_session_logging: false, 
        
           #connection_liveness_check_timeout: -1, # Not configured 
        
           max_connection_lifetime: 1.hour, # :set_max_connection_life_time 
        
           max_connection_pool_size: 100, #:set_max_pool_size 
        
           connection_acquisition_timeout: 1.minute, #:set_max_connection_acquisition_time 
        
           encryption: true, # :set_transport 
        
           trust_strategy: :trust_all_certificates, 
        
           connection_timeout: 30.seconds, # BoltSocketOptions_set_connect_timeout 
        
           max_transaction_retry_time: Internal::Retry::ExponentialBackoffRetryLogic::DEFAULT_MAX_RETRY_TIME, 
        
           #resolver: nil # :set_address_resolver 
        
           keep_alive: true, # BoltSocketOptions_set_keep_alive 
        
           # ???? BoltConfig_set_user_agent

). I suspect that there are some staged semaphores in the seabolt library which cause deadlock in case there are no more connections in the pool and one needs to be requested from the server. What advantage are you getting from multiple threads if only one thread can run at any time anyway? We could try putting semaphores all over but this will serialize everything for jruby as well. Maybe you could serialize everything much earlier in the process rather than relying on GIL. Does anyone of you have time to look at the seabolt source code (https://github.com/neo4j-drivers/seabolt)? Neo4j gave up maintenance on that and no help can be expected from their side.

mperice · 2020-08-06T23:03:50Z

Hmm, the thing is I first started seeing the connection hang issue when I had only 2 threads (2 sidekiq workers) and the connections hang basically on the first two write transactions (right after starting the sidekiq workers). So I don't think I exceeded the connection pool size. However, I was connecting to a remote neo4j server (average latency of around 40ms). When I tried the same setting on my local neo4j db, I couldn't reproduce the issue with only 2 threads. That is why I created the "synthetic" code with many more threads. It seems that there might be two different issues here.

efivash · 2020-08-07T15:59:24Z

Yeah, I don't think I'm hitting the thread pool limit either. I'm hitting this with 2 threads and a pool size of 100.

I don't run into this deadlock when each thread has its own driver, though I'm not sure how useful that information is.

ekampp · 2020-08-11T06:30:53Z

I can confirm that we're also seeing this problem on MRI with three threads running.

klobuczek · 2020-08-13T19:55:59Z

If anyone could confirm where in the code the deadlocked threads wait that would be very helpful in potentially finding a solution.

wmene · 2020-09-04T18:17:27Z

@klobuczek I took @efivash's hang2 method but ran it with 5 threads instead of 10. Here is the thread dump
https://gist.githubusercontent.com/wmene/7935df5d564fae38060cfa04d2c34eb1/raw/16e8a403bf1f0d5a10269d74fda901ed844e1ed9/threaddump.txt

4 threads are waiting for the GVL, one is waiting for a mutex inside libseabolt. The five threads start with (pry):3

Not sure if I'm interpreting this correctly, but the deadlock might be happening because the thread that's waiting for the seabolt mutex has the GVL. The thread that is waiting on the mutex is stuck in the function BoltDirectPool_acquire https://github.com/neo4j-drivers/seabolt/blob/1.7/src/seabolt/src/bolt/direct-pool.c#L234

One issue that is not clear, how did 3 of the threads get past the mutex in the BoltDirectPool_acquire?

RomainCavrot · 2020-10-09T09:51:05Z

Hello, I have the same problem. Do you think there is a chance it will be solved? And if so, when?

sezny · 2020-12-28T09:16:15Z

Hi ! Any news or temporal fix on this guys ?

sezny · 2020-12-28T10:25:34Z

It happens to me on Rails app when Puma has more than 1 thread and when I do multiple requests at the same time.
At some point the server hangs without logging anything and just timeout for any request after that point. (And restart is needed)

So I have set puma to 1 thread for now, but to use sidekiq and restore Puma 5 thread, a fix would be appreciated.
I suppose it has something to do with neo4j or bolt new session / acquire connexion as suggested earlier by @wmene

joshjordan · 2021-01-24T14:07:05Z

We're running into this as well.

@wmene you're right on regarding the GVL and the mutex. I'm also not sure about those three threads, but the thread that IS waiting on the mutex has the GVL, and the other 3 are waiting to acquire the GVL lock. So, you're dead locked. We have the same problem and its ultimately because neo4j-ruby-driver has ruby code attached to the seabolt logging functions. This is fine most of the time, but is not fine within a mutex. Here's what happens:

Thread A calls a function with a mutex, e.g. BoltDirectPool_acquire in your case and BoltDirectPool_release in mine.
Thread A acquires the mutex
Thread A calls a logging function, e.g. BoltLog_warning in your case and BoltLog_info in mine
seabolt calls the logging function, which uses FFI to call Ruby code, eventually calling gvl_acquire_common thus Thread A ends up waiting for the GVL

Meanwhile:

Thread B has the GVL and also calls BoltDirectPool_acquire
Thread B reaches the mutex in seabolt and waits for Thread A to release it

So Thread A has the mutex and wants the GVL, while Thread B has the GVL and wants the mutex. Deadlock.

Our example is a bit easier to read since its only two threads blocked up, not 4. Check out this gist: https://gist.github.com/joshjordan/a53b01b5feb7a0920efd4c3e8982e450 Here, Thread 10 is A from my example, and Thread 9 is B.

@klobuczek is there any simple way to detach the logging functions before calling the connection pool acquire and release functions? Or does this change need to be made at the seabolt level?

joshjordan · 2021-01-24T19:48:11Z

This is particularly unfortunate because the default behavior is for the neo4j-ruby-driver to default to level: ::Logger:ERROR and doesn't get any overrides from activegraph. So, the processes are hanging/crashing just because neo4j-ruby-driver is allowing seabolt to make no-op calls to info and warn that don't even end up being output :(

On first look, it doesn't seem that its safe to just disable logging during acquire/release, since multiple BoltConnection objects in seabolt may share the BoltCommunication object where the logging callbacks reside. I may be wrong about that, but currently thinking we'll need to pass a boolean through from the connection pool to near the logging calls, to indicate "don't log, we're in a mutex and it might not be safe".

The positive side of the no-op logging statements is that there isn't much impact to disabling logging. For anyone looking to escape this issue short-term, check out my branch at https://github.com/joshjordan/neo4j-ruby-driver/tree/temp-disable-logging-fix-deadlock which disables all logging callbacks from seabolt to neo4j-ruby-driver, since they aren't safe. This does, unfortunately, disable error logging from neo4j-ruby-driver as well, because I have not been able to verify that error is or is not ever called from a mutex-locked code path.

klobuczek · 2021-02-07T04:20:38Z

Thank you @joshjordan for the spot-on analysis. There is no easy solution to the problem. We would have to implement the logging function in C and make it write to the same file as the ruby logger.

klobuczek · 2021-09-26T23:27:17Z

The logging has been removed from the ffi version of the driver (1.7). The error should not happen anymore with the caveat of reduced logging.

concurrency issue on MRI

acb4c3b

klobuczek mentioned this pull request Aug 6, 2020

Possible issue with thread safety? neo4jrb/activegraph#1620

Open

klobuczek closed this Sep 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concurrency issue on MRI #47

concurrency issue on MRI #47

klobuczek commented Aug 6, 2020

klobuczek commented Aug 6, 2020

klobuczek commented Aug 6, 2020

mperice commented Aug 6, 2020

efivash commented Aug 7, 2020 •

edited

Loading

ekampp commented Aug 11, 2020

klobuczek commented Aug 13, 2020

wmene commented Sep 4, 2020 •

edited

Loading

RomainCavrot commented Oct 9, 2020

sezny commented Dec 28, 2020

sezny commented Dec 28, 2020

joshjordan commented Jan 24, 2021

joshjordan commented Jan 24, 2021 •

edited

Loading

klobuczek commented Feb 7, 2021

klobuczek commented Sep 26, 2021

concurrency issue on MRI #47

concurrency issue on MRI #47

Conversation

klobuczek commented Aug 6, 2020

klobuczek commented Aug 6, 2020

klobuczek commented Aug 6, 2020

mperice commented Aug 6, 2020

efivash commented Aug 7, 2020 • edited Loading

ekampp commented Aug 11, 2020

klobuczek commented Aug 13, 2020

wmene commented Sep 4, 2020 • edited Loading

RomainCavrot commented Oct 9, 2020

sezny commented Dec 28, 2020

sezny commented Dec 28, 2020

joshjordan commented Jan 24, 2021

joshjordan commented Jan 24, 2021 • edited Loading

klobuczek commented Feb 7, 2021

klobuczek commented Sep 26, 2021

efivash commented Aug 7, 2020 •

edited

Loading

wmene commented Sep 4, 2020 •

edited

Loading

joshjordan commented Jan 24, 2021 •

edited

Loading