Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost connections are not retried #989

Closed
akadusei opened this issue Dec 5, 2023 · 5 comments · Fixed by #990
Closed

Lost connections are not retried #989

akadusei opened this issue Dec 5, 2023 · 5 comments · Fixed by #990

Comments

@akadusei
Copy link
Contributor

akadusei commented Dec 5, 2023

avram/src/avram/database.cr

Lines 188 to 201 in 0b9797b

private def with_connection(&)
key = object_id
connections[key] ||= db.checkout
connection = connections[key]
begin
yield connection
ensure
if !connection._avram_in_transaction?
connection.release
connections.delete(key)
end
end
end

In the lines of code above, Avram::Database#with_connection helper checks out a connection and checks it in after use. But if the connection is lost/broken, the query is never retried, leading to DB::ConnectionLost errors:

(DB::ConnectionLost)
  from /tmp/lucky/lib/pg/src/pg/statement.cr:11:5 in 'perform_query'
  from /tmp/lucky/lib/db/src/db/statement.cr:93:9 in 'exec_query'
  from /tmp/lucky/lib/avram/src/avram/queryable.cr:291:7 in 'results'
  from /usr/share/crystal/src/indexable.cr:749:5 in 'first?'
  from /tmp/lucky/lib/shield/src/presets/user.cr:34:3 in 'run_operation'
  from /tmp/lucky/src/actions/current_login/create.cr:4:3 in 'call'
  from /tmp/lucky/lib/lucky/src/lucky/renderable.cr:130:16 in 'perform_action'
  from /tmp/lucky/lib/lucky/src/lucky/route_handler.cr:10:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:32:7 in 'call_next'
  from /usr/share/crystal/src/time.cr:357:5 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:30:7 in 'call'
  from /usr/share/crystal/src/http/server/handler.cr:32:7 in 'call_next'
  from /tmp/lucky/lib/lucky/src/lucky/force_ssl_handler.cr:36:8 in 'call'
  from /tmp/lucky/lib/lucky/src/lucky/request_id_handler.cr:24:5 in 'call'
  from /usr/share/crystal/src/http/server/request_processor.cr:51:11 in 'handle_client'
  from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
  from ???

This affects multiple apps of mine in a staging environment. I use a serverless database provide who suspends the server after a period of inactivity (it boots up again after a connection attempt is made). This breaks the connection if no interaction is made to the database for a while.

I've applied a patch and tested for about 3 days now; no errors yet. I used to get those errors almost daily from both the app server and Mel workers attempting to reach the database. Here is a trace from Mel:

(DB::ConnectionLost)
  from /tmp/lucky/lib/pg/src/pg/statement.cr:11:5 in 'perform_query'
  from /tmp/lucky/lib/db/src/db/statement.cr:93:9 in 'run'
  from /tmp/lucky/lib/mel/src/worker/instant_task.cr:4:3 in '->'
  from /usr/share/crystal/src/fiber.cr:146:11 in 'run'
  from ???

My patch looks like this:

  private def with_connection(&)
    key = object_id

    db.retry do # Retries lost connections according to the `retry_*` parameters in the connection URL.
      connections[key] ||= db.checkout
      connection = connections[key]

      begin
        yield connection
      ensure
        if !connection._avram_in_transaction?
          connection.release
          connections.delete(key)
        end
      end
    end
  end

Or maybe it's something else I've missed?

@jwoertink
Copy link
Member

I've also noticed this in my apps, but haven't been able to manually replicate to narrow it down. I guess I thought that crystal-db just did the retry stuff internally, but your patch makes sense.

I'm not sure what edge cases we'd have to consider here, but I'm down to give it a shot.

@akadusei
Copy link
Contributor Author

akadusei commented Dec 5, 2023

I guess I thought that crystal-db just did the retry stuff internally,...

From Brian Cardiff's comments here and here, it does until you do an explicit call to #checkout.

I'm not sure what edge cases we'd have to consider here...

I have a PR that passes all existing tests, but I'm not sure how to add a test case for this one. I should probably wait for a few more days to see if the issue does not recur in my staging apps to be really sure.

@jwoertink
Copy link
Member

Ah, ok. It looks like your patch is probably good then.

Oh, what happens if you don't have any retry_ params set? Does it still do a retry, or is it just ignored in that case?

@akadusei
Copy link
Contributor Author

akadusei commented Dec 6, 2023

Oh, what happens if you don't have any retry_ params set?

The default retry of 1 is used. If exceeded, DB::PoolRetryAttemptsExceeded exception is raised:

https://github.com/crystal-lang/crystal-db/blob/06df272740fb9141050681ae916c465cc8e70584/src/db/pool.cr#L239

In all cases, the app should work OK if connection is never lost; retry kicks in only after all connections in the pool are broken. I've just rerun the script/test with retry_attempts=0 for TestDatabase and everything checks out.

I've also not received any error report from staging yet, so seems we're good. I'm going to wait for 1 more day.

@jwoertink
Copy link
Member

That's great news! Thanks for investigating this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants