Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drivers.ErrBadConn returned for DB.QueryRowContext.Scan when context is cancelled #1062

Closed
mjl- opened this issue Oct 19, 2021 · 6 comments
Closed

Comments

@mjl-
Copy link
Contributor

mjl- commented Oct 19, 2021

I have been seeing unexpected "driver: bad connection" (driver.ErrBadConn) errors in my logging. In my case, these are returned by calls to database/sql's DB.QueryRowContext.Scan (and likely also on Tx's) when their context is cancelled. Cancelling contexts happens for me in practice because I pass contexts from http.Request's on to lib/pq. I believe DB.QueryRowContext.Scan should not be made to return driver.ErrBadConn when contexts are cancelled.

I'll add a small reproducer after I've created the issue.

I looks the following ordered events take place in lib/pq:

  • In conn.QueryContext, conn.query() returns successfully.
  • The goroutine from conn.watchCancel() (scheduled in conn.QueryContext) handles the cancellation, calling conn.setBad() to mark the connection.
  • rows.Next (for Scan after DB.QueryRowContext) checks the connection with conn.getBad() and returns a driver.ErrBadConn.

I would expect the Scan on the sql Row to return context.Canceled in this case.

Perhaps lib/pq's connections could start to keep track not just whether connections are bad, but also the error to return to future calls on the connection. For expired contexts, the errors would become eg context.Canceled or context.DeadlineExceeded.

@lunemec
Copy link

lunemec commented Nov 3, 2021

We are facing similar issue, cancelling the context by http server when user cancels the request triggers driver: bad connection to be raised. It seems that in the end it causes sql package to re-connect over and over again, until for some reason the connection pool is exhausted (that is most likely caused by different part of the code, maybe even stdlib or our own).

@mjl-
Copy link
Contributor Author

mjl- commented Nov 3, 2021

FWIW, I haven't seen unexpected connection pool exhaustion or problematic reconnects.

The workaround I put in my code is to check for the database/sql/driver.ErrBadConn explicitly and treating it similarly to how I treat context.Canceled errors: marking it a user error, not a server error. That saves me getting alerted.

otan pushed a commit that referenced this issue Nov 8, 2021
…d make rows.Next return it (#1064)

* add reproducer for issue 1062

see #1062

* Keep track of an error for a connection.

Instead of just whether the connection is ErrBadConn. Often times, the error
will still be ErrBadConn. But for expired/cancelled contexts, it will be the
error for the context. Most functions still return ErrBadConn per the
database/sql/driver contract ("ErrBadConn should only be returned from [...] a
query method"). For rows.Next() we return the context-related error.

The database/sql/driver contract doesn't look very precise. Is Next a "query
method" and should database/sql handle ErrBadConns when Next returns them?

Do we have more functions that should return the canceled-context error
message?

* in test for QueryRowContext, fail for all unexpected errors, not just for the one unexpected error

* in TestContextCancel*, accept context.Canceled as valid error

* explicitly test for driver.ErrBadConn in test that breaks the connection

feedback from otan

* move the mutex-protected data in a struct with the mutex, and fix an unsafe usage.

there may be unsafeness in this file.

feedback from otan
@fho
Copy link

fho commented Nov 11, 2021

This issue was solved in release 1.10.4 via via #1064 ?

@otan
Copy link
Collaborator

otan commented Nov 11, 2021

Should be!

@otan otan closed this as completed Nov 11, 2021
@qburst-sreeraj
Copy link

@otan @mjl-
We still have this issue.

@mjl-
Copy link
Contributor Author

mjl- commented Jun 17, 2022

I remember from last looking at the code thinking that lib/pq may not be adhering fully to the requirements set by database/sql about when it returns ErrBadConn. You could investigate in that direction...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants