server,sql: complement query_wait with conn_wait to wait until clients/pool closes connections #66319

lancel66 · 2021-06-10T14:19:29Z

Customer needs a truly hitless way of doing upgrades. Current behavior is to:

Push new binary to temp
Drain node
Install/replace binary
Restart crdb

Drain 1) stops accepting client connections, then 2) stops extant connections. The problem is that stopping existing connections causes issues with workloads when they are not retried.

They would like a hitless way to do rolling inplace upgrades.

One thought is to have a flag to tell drain to wait out connections for a definable period of time, rather than closing connections.

cc @rafiss

gz#8424

Epic CRDB-10458

Jira issue: CRDB-7990

blathers-crl · 2021-06-10T14:19:31Z

Hello, I am Blathers. I am here to help you get the issue triaged.

It looks like you have not filled out the issue in the format of any of our templates. To best assist you, we advise you to use one of these templates.

I was unable to automatically find someone to ping.

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

rafiss · 2021-06-10T15:45:29Z

I'd like to understand a bit more

One thought is to have a flag to tell drain to wait out connections for a definable period of time

Is the concern that any queries that are currently running are getting stopped? Have they already tried using the server.shutdown.query_wait cluster setting?

rather than closing connections.

At some point, the connection does need to get closed. I don't see a way around that. Is the customer connecting using a connection pool and load balancer? If so, then two points:

The load balancer should be checking the health of the CRDB nodes. It should be doing this by checking health on the /health?ready=1 endpoint. See more docs. If the load balancer is already checking that endpoint, but still is not removing CRDB nodes that are shutting down in a timely fashion, then they may need to increase the server.shutdown.drain_wait setting.
The connection pool should be checking the health of a connection before trying to use it. If a CRDB node is in the process of draining, then the connection pool should see that, and prevent the application from using the connection. The connection pool will then try again and give the application a connection that is healthy. See more docs, though it would be most helpful for them to review the docs of the connection pool they are using.

@lancel66 Based on all the above, if there are errors happening during upgrades, it sounds more like a bug report than a feature request. I recommend filing this through normal support channels. If you do so, it will also be very helpful to include more specific error messages and details on their load balancer, connection pool, and application.

rafiss · 2021-06-23T20:23:49Z

I've thought about this a bit more:

It seems like the concern is with the fact that cockroach closes connections on the server-side during the shutdown process. Some other systems don’t do this — they require all the clients to close the connections and only then will the server process let itself terminate.

The way crdb closes connections on the server-side could be a problem, since the client could be in the middle of using the connection when crdb decides to close it. (Perhaps even more common when a connection pool is in the mix, as we've seen with other scenarios; see internal tickets https://github.com/cockroachlabs/support/issues/984 and https://cockroachlabs.atlassian.net/browse/CC-2756)

It seems like we might want to offer one more knob similar to server.shutdown.query_wait — except instead of waiting for queries to finish, it should wait for connections to be terminated on the client side. It could be named something like server.shutdown.connection_wait.

So the order of waiting would be:

drain_wait: wait for load balancer to notice that instances are shutting down
connection_wait: wait for clients to close connections (could be set to "infinity" to satisfy the request in this ticket)
query_wait: wait longer in case there are still queries in flight
lease_transfer_wait: wait for leases to move to new nodes after now that all queries are done

Also, instead of a cluster setting, this might work as a flag to the drain command, as the initial feature request points out.

Currently, the draining process is consist of three consecutive periods: - `drain_wait`: The `/health?ready=1` starts to show that the node is shutting down. New SQL connections and new queries are allowed. The server does a hard wait till the timeout. - `query_wait`: New SQL connections are not allowed. SQL Connections with no queries in flight will be closed by the server immediately. The rest of these SQL connections will be terminated by the server as soon as their queries are finished. Early exit if all queries are finished. - `lease_transfer_wait`: Wait to transfer range leases. The server does a hard wait till the timeout. This commit reorganizes the draining process by adding a connection_wait period, and slightly modifying the existing ones: - `drain_wait`: The `/health?ready=1` starts to show that the node is shutting down. New SQL connections and new queries are allowed. The server does a hard wait till the timeout. - `connection_wait`: Wait until all SQL connections are closed or timeout. New SQL connections are not allowed. Existing SQL connections and new queries are allowed. We do an early exist if all SQL connections are closed by the user. - `query_wait`: SQL connections with no queries in flight will be closed by the server immediately. The rest of these SQL connections will be terminated by the server as soon as their queries are finished. Early exit if all queries are finished. - `lease_transfer_wait`: Wait to transfer range leases. The server does a hard wait till the timeout. The duration of `connection_wait` can be set similarly to the other 3 variables: ``` SET CLUSTER SETTING server.shutdown.connection_wait = '40s' ``` Resolves cockroachdb#66319 Release note: TBD

This commit is to add a phase to current draining process. At this phase, the server waits for SQL connections to be closed. New SQL connections are not allowed now. Once all SQL connections are closed, the server proceeds to draining the range leases. The maximum duration of this phase is determined by the cluster setting `server.shutdown.connection_wait` The duration can be set similarly to the other 3 existing draining phases: ``` SET CLUSTER SETTING server.shutdown.connection_wait = '40s' ``` Resolves cockroachdb#66319 Release note (ops change): add `server.shutdown.connection_wait` to the draining process configuration.

This commit is to add a phase to current draining process. At this phase, the server waits for SQL connections to be closed. New SQL connections are not allowed now. Once all SQL connections are closed, the server proceeds to draining the range leases. The maximum duration of this phase is determined by the cluster setting `server.shutdown.connection_wait` The duration can be set similarly to the other 3 existing draining phases: ``` SET CLUSTER SETTING server.shutdown.connection_wait = '40s' ``` Resolves cockroachdb#66319 Release note (ops change): add `server.shutdown.connection_wait` to the draining process configuration. Release justification: This new cluster setting `server.shutdown.connection_wait` enables users to set the maximum waiting period for SQL connections to close during draining. This provides a workaround when customers encountered intermittent blips and failed requests when they were performing operations that are related to restarting nodes. The default draining process is unchanged.

This commit is to add a phase to current draining process. At this phase, the server waits for SQL connections to be closed. New SQL connections are not allowed now. Once all SQL connections are closed, the server proceeds to draining the range leases. The maximum duration of this phase is determined by the cluster setting `server.shutdown.connection_wait` The duration can be set similarly to the other 3 existing draining phases: ``` SET CLUSTER SETTING server.shutdown.connection_wait = '40s' ``` Resolves cockroachdb#66319 Release note (ops change): add `server.shutdown.connection_wait` to the draining process configuration. This provides a workaround when customers encountered intermittent blips and failed requests when they were performing operations that are related to restarting nodes. Release justification: Low risk, high benefit changes to existing functionality (optimize the node draining process).

72991: server,sql: implement connection_wait for graceful draining r=ZhouXing19 a=ZhouXing19 Currently, the draining process is consist of three consecutive periods: 1. Server enters the "unready" state: The `/health?ready=1` http endpoint starts to show that the node is shutting down, but new SQL connections and new queries are still allowed. The server does a hard wait till the timeout. This phrase's duration is set with cluster setting `server.shutdown.drain_wait`. 2. Drain SQL connections: New SQL connections are not allowed. SQL Connections with no queries in flight will be closed by the server immediately. The rest of these SQL connections will be terminated by the server as soon as their queries are finished. Early exit if all queries are finished. This phrase's maximum duration is set with cluster setting `server.shutdown.query_wait`. 3. Drain range lease: the server keeps retrying forever until all range leases on this draining node have been transferred. Each retry iteration's duration is specified by the cluster setting `server.shutdown.lease_transfer_timeout`. This commit reorganizes the draining process by adding a phrase where the server waits SQL connections to be closed, and once all SQL connections are closed before timeout, the server proceeds to the next draining phase. The newly proposed draining process is: 1. (unchanged) Server enters the "unready" state: The `/health?ready=1` http endpoint starts to show that the node is shutting down, but new SQL connections and new queries are still allowed. The server does a hard wait till the timeout. This phrase's duration is set with cluster setting `server.shutdown.drain_wait`. 2. (new phase) Wait SQL connections to be closed: New SQL connections are not allowed now. The server waits for the remaining SQL connections to be closed or timeout. Once all SQL connections are closed, the draining proceed to the next phase. The maximum duration of this phase is determined by the cluster setting `server.shutdown.connection_wait`. 3. (unchanged) Drain SQL connections: New SQL connections are not allowed. SQL Connections with no queries in flight will be closed by the server immediately. The rest of these SQL connections will be terminated by the server as soon as their queries are finished. Early exit if all queries are finished. This phrase's maximum duration is set with cluster setting `server.shutdown.query_wait`. 4. (unchanged) Drain range lease: the server keeps retrying forever until all range leases on this draining node have been transferred. Each retry iteration's duration is specified by the cluster setting `server.shutdown.lease_transfer_timeout`. The duration of the new phase ("Wait SQL connections to close") can be set similarly to the other 3 existing draining phases: ``` SET CLUSTER SETTING server.shutdown.connection_wait = '40s' ``` Resolves #66319 Release note (ops change): add `server.shutdown.connection_wait` to the draining process configuration. This provides a workaround when customers encountered intermittent blips and failed requests when they were performing operations that are related to restarting nodes. Release justification: Low risk, high benefit changes to existing functionality (optimize the node draining process). 76430: [CRDB-9550] kv: adjust number of voters needed calculation when determining replication status r=Santamaura a=Santamaura Currently, when a range has non-voting replicas and it is queried through replication stats, it will be reported as underreplicated. This is because in the case where a zone is configured to have non-voting replicas, for the over/under replicated counts, we compare the number of current voters to the total number of replicas which is erroneus. Instead, we will compare current number of voters to the total number of voters if voters has been set and otherwise will defer to the total number of replicas. This patch ignores the desired non-voters count for the purposes of this report, for better or worse. Resolves #69335. Release justification: low risk bug fix Release note (bug fix): use total number of voters if set when determining replication status Before change: ![Screen Shot 2022-02-11 at 10 03 57 AM](https://user-images.githubusercontent.com/17861665/153615571-85163409-5bac-40f4-9669-20dce77185cf.png) After change: ![Screen Shot 2022-02-11 at 9 53 04 AM](https://user-images.githubusercontent.com/17861665/153615316-785b156b-bd23-4cfa-a76d-7c9fa47fbf1e.png) 77315: backupccl: backup correctly tries reading in from base directory if l… r=DarrylWong a=DarrylWong …atest/checkpoint files aren't found Before, we only tried reading from the base directory if we caught a ErrFileDoesNotExist error. However this does not account for the potential error thrown when the progress/latest directories don't exist. This changes it so we now correctly retry reading from the base directory. We also put the latest directory inside of a metadata directory, in order to avoid any potential conflicts with there being a latest file and latest directory in the same base directory. Also wraps errors in findLatestFile and readLatestCheckpointFile for more clarity when both base and latest/progress directories fail to read. Fixes #77312 Release justification: Low risk bug fix Release note: none 77406: backupccl: test ignore ProtectionPolicy for exclude_data_from_backup r=dt a=adityamaru This change adds an end to end test to ensure that a table excluded from backup will not holdup GC on its replica even in the presence of a protected timestamp record covering the replica From a users point of view, this allows them to mark a table whose row data will be excluded from backup, and to set that tables gc.ttl to a very low value. Backups that write PTS records will no longer holdup GC on such low GC TTL tables. Fixes: #73536 Release note: None Release justification: low risk update to new functionality 77450: ui: add selected period as part of cached key r=maryliag a=maryliag Previously, the fingerprint id and the app names were used as a key for a statement details cache. This commits adds the start and end time (when existing) to the key, so the details are correctly assigned to the selected period. This commit also rounds the selected value period to the hour, since that is what is used on the persisted statistics, with the start value keeping the hour and the end value adding one hour, for example: start: 17:45:23 -> 17:00:00 end: 20:14:32 -> 21:00:00 Partially addresses #72129 Release note: None Release Justification: Low risk, high benefit change 77597: kv: Add `created` column to `active_range_feeds` table. r=miretskiy a=miretskiy Add `created` column to `active_range_feeds` table. This column is initialized to the time when the partial range feed was created. This allows us to determine, among other things, whether or not the rangefeed is currently performing a catchup scan (i.e. it's resolved column is 0), and how long the scan has been running for. Release Notes (enterprise): Add created time column to `crdb_internal.active_range_feeds` virtual table to improve observability and debugability of rangefeed system. Fixes #77581 Release Justification: Low impact observability/debugability improvement. Co-authored-by: Jane Xing <[email protected]> Co-authored-by: Santamaura <[email protected]> Co-authored-by: Darryl <[email protected]> Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]>

lancel66 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jun 10, 2021

blathers-crl bot added O-community Originated from the community X-blathers-untriaged blathers was unable to find an owner labels Jun 10, 2021

rafiss removed the X-blathers-untriaged blathers was unable to find an owner label Jun 11, 2021

jlinder added T-server-and-security DB Server & Security T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Jun 16, 2021

lunevalex mentioned this issue Jun 24, 2021

server: restart a large Cockroach cluster with no impact to foreground traffic #66848

Open

rafiss mentioned this issue Jun 30, 2021

Draining SQL connections, as is done during CRDB graceful updates, can lead to short blips in availability when using common conn pool setups, due to the closing of SQL connections by CRDB #67071

Closed

knz changed the title ~~"Hitless" upgrade feature request~~ server,sql: complement query_wait with conn_wait to wait until clients/pool closes connections Jul 29, 2021

knz added the A-cc-enablement Pertains to current CC production issues or short-term projects label Jul 29, 2021

exalate-issue-sync bot removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Oct 6, 2021

exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-server-and-security DB Server & Security labels Oct 26, 2021

exalate-issue-sync bot assigned vy-ton Oct 26, 2021

rafiss assigned ZhouXing19 and unassigned vy-ton Nov 11, 2021

knz mentioned this issue Nov 18, 2021

server: decommission->decommissioned transition causes abrupt loss of service, need to introduce final wait #72754

Open

knz mentioned this issue Jan 5, 2022

server: equip the mt-start-sql server code with a drain process #74412

Closed

ZhouXing19 mentioned this issue Jan 20, 2022

server,sql: implement connection_wait for graceful draining #72991

Merged

craig bot closed this as completed in 345bb31 Mar 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server,sql: complement query_wait with conn_wait to wait until clients/pool closes connections #66319

server,sql: complement query_wait with conn_wait to wait until clients/pool closes connections #66319

lancel66 commented Jun 10, 2021 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Jun 10, 2021

rafiss commented Jun 10, 2021

rafiss commented Jun 23, 2021 •

edited

Loading

server,sql: complement query_wait with conn_wait to wait until clients/pool closes connections #66319

server,sql: complement query_wait with conn_wait to wait until clients/pool closes connections #66319

Comments

lancel66 commented Jun 10, 2021 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Jun 10, 2021

rafiss commented Jun 10, 2021

rafiss commented Jun 23, 2021 • edited Loading

lancel66 commented Jun 10, 2021 •

edited by cockroach-jira-scripts

Loading

rafiss commented Jun 23, 2021 •

edited

Loading