-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draining SQL connections, as is done during CRDB graceful updates, can lead to short blips in availability when using common conn pool setups, due to the closing of SQL connections by CRDB #67071
Comments
I have various Qs about this still:
Do any connection pool implementations avoid this failure mode by checking for a closed connection (which IIUC doesn't require sending / receiving over the network so could be fast) & retrying without bubbling up an error to the user if one is found? That seems possible to me still & could explain why we don't see evidence of this happening widely, despite many customers running in production.
I don't really understand what hirari & other connection pool implementations do here. When they hit I/O timeout errors specifically, they then check health of connection & if that second check fails, then they recycle the connection? In a way this ties into my first Q. I wouldn't mind links to the code to answer this second Q. Perhaps @rafiss has thoughts (tho no rush Rafi... I know you are busy). |
One thing missing from the issue summary here is that connection pools already check for the health of a connection before using it. Specifically, here is the order of events when the errors happen.
Here is the getConnection code in Hikari. Note that it calls After the connection pool gives the connection to the application, then the connection pool can no longer perform the check you are proposing. The application itself could perform the check each time it tries to use the connection (a connection which was already deemed "healthy"), but that would be incredibly onerous in the code. But that brings me to the next point -- how would doing the check even help? The client will end up in the same situation:
The important point is that there's no general retry logic that can be used. If the application is in the middle of a transaction when it gets the error, then the connection pool wouldn't know that it has to retry all the previous statements from that transaction. Or if the application had already finished one transaction, but got the error when doing a second, the connection pool wouldn't know what's already succeeded and what hasn't.
The pool always is checking the health of a connection when it gives it to the application. Specifically in Hikari, this block will return false for isConnectionAlive. Also, more to your question, if an error happens while the connection is in use, then the logic of Hikari's ProxyConnection will evict that connection from the pool. It would be helpful to understand what we want out of this issue beyond what is described in #66319 |
I think we want a shared mental model of the problem, rather than a feature request that may improve the situation.
The fact that the need to close a conn in the middle of a txn leads to an unavailability blip makes sense to me. Customers need to not have overly long running queries if they want zero downtime. We can make that recommendation, that is, txns must finish within I am thinking only about idle conns. The closing of an idle conn can currently lead to an availability blip, and I don't understand why this is necessary.
Perhaps this is the answer to my Q but I don't understand it.
Or this? Is a specific connection literally returned to the application always in hikari land? Let's look at https://golang.org/pkg/database/sql/#DB. You can ask for a specific conn via https://golang.org/pkg/database/sql/#DB.Conn but why would you do that? That opens you up this failure mode & also failures when nodes dies randomly. But if you query always via |
Trying to understand whether this issue could hit a user of golang
Drilling deeper:
Also this:
There is a retry loop for getting a usable connection. From https://golang.org/src/database/sql/sql.go. What does
Focusing on the use an existing free conn bit:
Return Looks about right:
Specifically this part in the docs: "... to signal a bad connection". From https://golang.org/src/database/sql/driver/driver.go
So a user of OTOH, if you call Not sure I have this right but I am learning things at least... |
Also, @rafiss, after reading
I'll update the issue summary. Wondering if you have any thoughts about the
|
Now I am looking at the Look at this beautiful thing:
Combine above with below from
I think if you use
(Maybe there is a race if query happens right as conn is being closed meaning conn is not fully closed yet tho. Sounds like it as per "if we got a network error before we had a chance to send the query".) Interestingly, I think you CAN hit the failure mode when trying to open a txn, as there is no returning of
|
I updated the issue summary as per feedback from Rafi! |
A very long term idea: we could think about implementing live TCP migrations. There are systems that allow you to move a TCP connection from one node to another, with no interruption to the other end. https://twitter.com/colmmacc/status/1407820679962054662?s=20 |
Soft of relevant
In other words, yes, there is a race. |
@rafiss @joshimhoff for my own education, can you tell me how this issue and #66319 differ from each other? |
I think #66319 proposes a modification to CRBD (see title especially) that may help with the issue outlined in this ticket. If we want to merge things somehow, that SGTM, tho I think it's reasonable to not given above. I feel even more that the details in this ticket re: the nature of the problem should be stored somewhere, as it took a long time for me, Rafi, & Olvier get on the same page about the problem, as it is rather subtle. |
I wrote a toy test program, and found that using the Expand this to see the `pgx` with `database/sql` code
Note the error does not happen when using The error does happen also when using Expand this to see the `pgx` with `pgxpool` code
|
NICE REPRO! |
Does it repro both
That is, perhaps the pgx driver with database/sql doesn't retry around We could experiment by patching them to a special build that doesn't ever return |
closing as part of Epic CRDB-10458 |
Describe the problem
As discovered by @rafiss & @otan, draining SQL connections, as is done during CRDB graceful updates, leads to short blips in availability when using common conn pool setups, due to the closing of SQL connections by CRDB.
Our understanding of the mechanics is as follows:
drain_wait
period (default 0s), where connections can no longer be established but existing connections motor on. /healthz will return unavailable, so LBs can mark the node as unavailable.query_wait
(default 10s), where CRDB attempts to close connections that go from non-idle to idle during query execution.query_wait
, CRDB closes open SQL connections - idle or not.cockroach/pkg/sql/pgwire/conn.go
Lines 495 to 505 in bb0593d
Note there are some mitigating factors, which can sort of be understood as a way fo communicating a conn is not be used to the client, but there are races and so some availability blip can still happen:
aliveBypassWindowMs
time has passed since last health check of conn, a health check is run. This health check will fail if the conn is already closed. Then no availability blip. The issue is there is a race: The health check may pass then after that CRDB may close the conn. Then there is an availability blip.database/sql
withpgx
driver: Not sure if there a blackbox health check conn before use when grabbing an idle already used conn like with hikari. OTOH, see Draining SQL connections, as is done during CRDB graceful updates, can lead to short blips in availability when using common conn pool setups, due to the closing of SQL connections by CRDB #67071 (comment) for details on how if errors before sending over the network (presumably if conn is gracefully closed this happens), retries are made bydatabase/sql
in a way that is transparent to the client. But there is also a race, which is sort of discussed here: database/sql: retry logic is unsafe golang/go#11978. If the conn is closed post sending on the socket, it's not possible to retry. Thanks, @jeffswenson, for posting this issue.In summary, this failure mode is prob present in all or almost all conn pool setups, but magnitude of unavailability blip prob depends on conn pool details & workload details.
Without additional changes to CRDB (which are discussed in the below doc):
drain_wait
+ estimate of max txn time + estimate of how long takes to remove nodes from the LB's list of healthy nodes, so that by the timedrain_wait
seconds pass, the client has closed all connections to the draining node itself. This means a mix of longer drain time & shorter connection lifetimes; the latter means higher CPU usage esp. if you do password authentication; the former has disadvantages including greater time in a mixed version state.There is a longer form doc about this from @otan and @rafiss: https://docs.google.com/document/d/1oOQ4pnTxEdUPA69qlK5NLCk0C-RldIJeXdlcY1HE3nY/edit#
To Reproduce
From @rafiss re: hikari: https://docs.google.com/document/d/1b7ENgM36xYepaQ-PPG5PESl133m7YcPPatayRGGuAkM/edit
Expected behavior
CRBD markets itself as having a zero downtime upgrade process & this ticket points out that to achieve this goal, connection pool settings + CRDB settings need to be set carefully, and also in a way that has definite downsides.
Additional context
This is affecting a CC production customer.
gz#8424
Epic CRDB-10458
Jira issue: CRDB-8347
The text was updated successfully, but these errors were encountered: