Lost connections are not retried #989

akadusei · 2023-12-05T19:11:25Z

Lines 188 to 201 in 0b9797b

    
           private def with_connection(&) 
        
             key = object_id 
        
             connections[key] ||= db.checkout 
        
             connection = connections[key] 
        
             begin 
        
               yield connection 
        
             ensure 
        
               if !connection._avram_in_transaction? 
        
                 connection.release 
        
                 connections.delete(key) 
        
               end 
        
             end 
        
           end

In the lines of code above, Avram::Database#with_connection helper checks out a connection and checks it in after use. But if the connection is lost/broken, the query is never retried, leading to DB::ConnectionLost errors:

(DB::ConnectionLost)
  from /tmp/lucky/lib/pg/src/pg/statement.cr:11:5 in 'perform_query'
  from /tmp/lucky/lib/db/src/db/statement.cr:93:9 in 'exec_query'
  from /tmp/lucky/lib/avram/src/avram/queryable.cr:291:7 in 'results'
  from /usr/share/crystal/src/indexable.cr:749:5 in 'first?'
  from /tmp/lucky/lib/shield/src/presets/user.cr:34:3 in 'run_operation'
  from /tmp/lucky/src/actions/current_login/create.cr:4:3 in 'call'
  from /tmp/lucky/lib/lucky/src/lucky/renderable.cr:130:16 in 'perform_action'
  from /tmp/lucky/lib/lucky/src/lucky/route_handler.cr:10:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:32:7 in 'call_next'
  from /usr/share/crystal/src/time.cr:357:5 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:32:7 in 'call_next'
  from /tmp/lucky/lib/lucky/src/lucky/force_ssl_handler.cr:36:8 in 'call'
  from /tmp/lucky/lib/lucky/src/lucky/request_id_handler.cr:24:5 in 'call'
  from /usr/share/crystal/src/http/server/request_processor.cr:51:11 in 'handle_client'
  from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
  from ???

This affects multiple apps of mine in a staging environment. I use a serverless database provide who suspends the server after a period of inactivity (it boots up again after a connection attempt is made). This breaks the connection if no interaction is made to the database for a while.

I've applied a patch and tested for about 3 days now; no errors yet. I used to get those errors almost daily from both the app server and Mel workers attempting to reach the database. Here is a trace from Mel:

(DB::ConnectionLost)
  from /tmp/lucky/lib/pg/src/pg/statement.cr:11:5 in 'perform_query'
  from /tmp/lucky/lib/db/src/db/statement.cr:93:9 in 'run'
  from /tmp/lucky/lib/mel/src/worker/instant_task.cr:4:3 in '->'
  from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
  from ???

My patch looks like this:

  private def with_connection(&)
    key = object_id

    db.retry do # Retries lost connections according to the `retry_*` parameters in the connection URL.
      connections[key] ||= db.checkout
      connection = connections[key]

      begin
        yield connection
      ensure
        if !connection._avram_in_transaction?
          connection.release
          connections.delete(key)
        end
      end
    end
  end

Or maybe it's something else I've missed?

The text was updated successfully, but these errors were encountered:

jwoertink · 2023-12-05T19:27:37Z

I've also noticed this in my apps, but haven't been able to manually replicate to narrow it down. I guess I thought that crystal-db just did the retry stuff internally, but your patch makes sense.

I'm not sure what edge cases we'd have to consider here, but I'm down to give it a shot.

akadusei · 2023-12-05T20:33:27Z

I guess I thought that crystal-db just did the retry stuff internally,...

From Brian Cardiff's comments here and here, it does until you do an explicit call to #checkout.

I'm not sure what edge cases we'd have to consider here...

I have a PR that passes all existing tests, but I'm not sure how to add a test case for this one. I should probably wait for a few more days to see if the issue does not recur in my staging apps to be really sure.

jwoertink · 2023-12-05T21:53:13Z

Ah, ok. It looks like your patch is probably good then.

Oh, what happens if you don't have any retry_ params set? Does it still do a retry, or is it just ignored in that case?

akadusei · 2023-12-06T15:45:46Z

Oh, what happens if you don't have any retry_ params set?

The default retry of 1 is used. If exceeded, DB::PoolRetryAttemptsExceeded exception is raised:

https://github.com/crystal-lang/crystal-db/blob/06df272740fb9141050681ae916c465cc8e70584/src/db/pool.cr#L239

In all cases, the app should work OK if connection is never lost; retry kicks in only after all connections in the pool are broken. I've just rerun the script/test with retry_attempts=0 for TestDatabase and everything checks out.

I've also not received any error report from staging yet, so seems we're good. I'm going to wait for 1 more day.

jwoertink · 2023-12-06T16:03:35Z

That's great news! Thanks for investigating this.

akadusei mentioned this issue Dec 7, 2023

Retry lost connections #990

Merged

jwoertink closed this as completed in #990 Dec 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost connections are not retried #989

Lost connections are not retried #989

akadusei commented Dec 5, 2023

jwoertink commented Dec 5, 2023

akadusei commented Dec 5, 2023

jwoertink commented Dec 5, 2023

akadusei commented Dec 6, 2023

jwoertink commented Dec 6, 2023

Lost connections are not retried #989

Lost connections are not retried #989

Comments

akadusei commented Dec 5, 2023

jwoertink commented Dec 5, 2023

akadusei commented Dec 5, 2023

jwoertink commented Dec 5, 2023

akadusei commented Dec 6, 2023

jwoertink commented Dec 6, 2023