-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handling dead connections and broken pipes #74
Comments
Are you connecting over a very unstable network? Lost connections should be a very rare occurrence under normal circumstances. Especially this code below: conn, err := pool.Acquire()
if err == pgx.ErrDeadConn {
if maxRetries > 0 {
glog.Debugf("%v, try acquiring a non-dead connection", err)
return doWithConnection(pool, maxRetries-1, work)
}
} pool.Acquire should only return that error if it needed to establish a new connection and it failed -- which assuming it previously was able to connect would likely indicate a network outage. If it's not giving you a good connection for any other reason then that is potentially an area for improvement in pgx. As far as the actual retry logic, it absolutely needs to wrap the call to I don't think pgx should handle retry logic internally. There are many potential error cases that need careful consideration. A generalized retry framework that worked as expected in the majority of situations would be difficult to get right. For most users lost connections that can immediately be recovered from by reconnecting are rare enough that it is better have the operation fail than attempt to auto-redo and potentially cause unexpected side-effects. |
I can confirm that at least part of these problems are caused by unrealiable networking (wireless connection for development). Especially during development (and interupting meeting etc), it can happen that my serivce running locally will not get any requests for the coming hour. I recall that these situations were not handled by the package resulting in stale connections. I will reconsider the handling of failures on a type-by-type basis and will test them individually. For this purpose, I created a small tcp-proxy service with a http interface to inject such failure situations (closing the client, closing remote, delaying response data...) |
this occurrence is very common if you connect to the GCE Cloud SQL. |
I have not used GCE Cloud SQL, so I don't know how its idle detection works, but pgx does utilize TCP keep-alive by default. https://github.com/jackc/pgx/blob/master/conn.go#L143 As far as the general case of lost connection and automatic retry, my opinion is still that retry logic would tend to have application-specific requirements and a generalized framework in the database driver layer would be very difficult to get right. I haven't given this much thought, but possibly the database driver or connection pool could reduce the number of lost connection failures the application has to deal with by occasionally "pinging" the server with "select 1". That could tend to detect some/most dead connections before the application encounters an error. Not sure all the implications of that though. |
The database/sql compatible pgx driver does return ErrBadConn as required, and when the stdlib's database/sql APIs get that error from the underlying driver, they do retry the operations (currently, up to 10 times). It'd be nice if the native pgx APIs behaved the same way |
Really surprising to me that database/sql has retry logic built-in. I investigated what database/sql is doing, and was able to produce the type errors I was concerned about (see go_database_sql_retry_bug for my test case). I reported this bug on the Go project. You can check there for more details and to see where the Go project ends up on this issue. I still hold to the opinion that an automatic retry system cannot be correct. If there is a conflict between programmer ease-of-use and correctness, I will choose correctness. Automatic retry can sabotage correctness. That said, I'm not opposed to an opt-in system for retries. It might be pretty simple to have a retry wrapper object around ConnPool. Then the underlying system is correct, but this retry object can be used as desired. There's also still the possibility of automatically "pinging" the server with "select 1" on occasion. This could both keep idle connections from being killed, and detect dead connections and reconnect in the library layer without the application's notice. |
Would it be reasonable to change the behaviour of Acquire so that it will try find a live connection? Currently if you have a connection pool which has a number of connections in it, which are then disconnected due to say Postgres server going away they will remain in the pool and will be returned to users via Acquire, who then have to check conn.IsAlive and call pool.Release (which will de-allocate the dead connection) then continue to acquire connections this way until a new connection is established with the backend. It seems to me it would be both safe and reasonable for the connection pool to encapsulate this behaviour, attempting to exhaust the pool for a live connection before trying to establish a new one and returning an error if that fails. Thus only ever returning connections that were live at the time of acquisition. I might put together a pull request to better illustrate the behaviour. |
The ConnPool checks IsAlive when a connection is released (https://github.com/jackc/pgx/blob/master/conn_pool.go#L114). So bad connections are never returned to the pool. Conn.IsAlive only checks to see if the connection has marked itself as dead (https://github.com/jackc/pgx/blob/master/conn.go#L677). So if the connection was returned to the pool successfully, then the connection will always think it is alive on checkout. The fundamental problem is there does not appear to be a way to determine if a net.Conn has been dropped without actually calling Read or Write. So I don't know a way to ensure that the underlying connection is valid on ConnPool.Acquire. Where retry logic could be safely introduced is in ConnPool.Begin. That function could retry with a new connection when its internal Conn.Begin fails. As an aside, my reply about a possible error in the database/sql's retry logic was not entirely correct. database/sql has a narrow, correct case where it will do a retry. The bug appears to be in the pq library. It returns the error code that signals to database/sql that it can do a retry when it is not safe to do so. |
Ahh I see, that makes sense. I was under the impression Conn.IsAlive checked if the socket was still good but I guess I should have validated that assumption. I guess I will write a wrapper that checks out a connection, attempts Begin and continues to do that until either Begin succeeds or Acquire fails (which would indicate we couldn't open a new connection). |
If you are starting a transaction immediately after checking out a connection, consider using ConnPool.Transaction instead. It would save you a few lines of code. Also, ConnPool.Begin could be altered to safely retry automatically -- I just haven't gotten around to it. If that change was made then you may not need to use a wrapper (a wrapper still could be valuable for retrying operations known to be idempotent). |
Commits 4868929 through 93aa2b2 (thanks @josephglanville) add retry logic to ConnPool Begin. I think this is as good as we can get for safely and automatically retrying on dead connections. I'm going to close this issue. If someone thinks of another case we can safely handle, feel free to reopen this issue or create a new one. |
@jackc Do normal (non-pooled) connections ever retry? I'd like to use some |
No. The only retry logic in pgx is |
Great, thanks |
I just wanted to add that I ran into dead connections today within a Lambda function connecting to RDS. It's the first time it's happened in months of operation. This is definitely not a problem with this library, just something we need to make sure we handle on our end. This is only a problem if you connect outside of the lambda handler, if you connect and disconnect within each run you wouldn't run into this. I'm also just connecting to the database directly, rather than acquiring a connection from the pool. This might be a mistake but it seemed pointless to do pooling inside lambdas. The way this surfaced itself was in the form of 2 errors:
If anyone knows the best way to handle this situation please let me know! |
I have similar issues, also with RDS. No lambda in my case, just traditional long-living deployment. We're using connection pool. |
You know, I haven't had a chance to test this yet, but I think using Pooling with a maxConnection of 1 would resolve this |
Seems like it should, however it won't fit our needs in terms of performance. |
Also, last commit is about just this error, and we've already updated our pgx everywhere. You should try that too! |
Per jackc/pgx#74 (comment) moving reads into a transaction will initiate retries on network-level errors.
Based on our experiences with this package, we have come up with what I consider a workaround for handling certain errors returned. In our apps, we use the ConnPool and therefore? have to handle dead connections. More recently, we found that the underlying connection can report an broken pipe and EOF. We think these errors are recoverable by the app by re-aquiring a new connection.
The function below is what I currently use in all Dao functions. Find below an example of this.
Example:
The question is whether the pgx package should handle this internally and if not, it this the correct way to do so.
Thanks.
The text was updated successfully, but these errors were encountered: