Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
72991: server,sql: implement connection_wait for graceful draining r=ZhouXing19 a=ZhouXing19 Currently, the draining process is consist of three consecutive periods: 1. Server enters the "unready" state: The `/health?ready=1` http endpoint starts to show that the node is shutting down, but new SQL connections and new queries are still allowed. The server does a hard wait till the timeout. This phrase's duration is set with cluster setting `server.shutdown.drain_wait`. 2. Drain SQL connections: New SQL connections are not allowed. SQL Connections with no queries in flight will be closed by the server immediately. The rest of these SQL connections will be terminated by the server as soon as their queries are finished. Early exit if all queries are finished. This phrase's maximum duration is set with cluster setting `server.shutdown.query_wait`. 3. Drain range lease: the server keeps retrying forever until all range leases on this draining node have been transferred. Each retry iteration's duration is specified by the cluster setting `server.shutdown.lease_transfer_timeout`. This commit reorganizes the draining process by adding a phrase where the server waits SQL connections to be closed, and once all SQL connections are closed before timeout, the server proceeds to the next draining phase. The newly proposed draining process is: 1. (unchanged) Server enters the "unready" state: The `/health?ready=1` http endpoint starts to show that the node is shutting down, but new SQL connections and new queries are still allowed. The server does a hard wait till the timeout. This phrase's duration is set with cluster setting `server.shutdown.drain_wait`. 2. (new phase) Wait SQL connections to be closed: New SQL connections are not allowed now. The server waits for the remaining SQL connections to be closed or timeout. Once all SQL connections are closed, the draining proceed to the next phase. The maximum duration of this phase is determined by the cluster setting `server.shutdown.connection_wait`. 3. (unchanged) Drain SQL connections: New SQL connections are not allowed. SQL Connections with no queries in flight will be closed by the server immediately. The rest of these SQL connections will be terminated by the server as soon as their queries are finished. Early exit if all queries are finished. This phrase's maximum duration is set with cluster setting `server.shutdown.query_wait`. 4. (unchanged) Drain range lease: the server keeps retrying forever until all range leases on this draining node have been transferred. Each retry iteration's duration is specified by the cluster setting `server.shutdown.lease_transfer_timeout`. The duration of the new phase ("Wait SQL connections to close") can be set similarly to the other 3 existing draining phases: ``` SET CLUSTER SETTING server.shutdown.connection_wait = '40s' ``` Resolves #66319 Release note (ops change): add `server.shutdown.connection_wait` to the draining process configuration. This provides a workaround when customers encountered intermittent blips and failed requests when they were performing operations that are related to restarting nodes. Release justification: Low risk, high benefit changes to existing functionality (optimize the node draining process). 76430: [CRDB-9550] kv: adjust number of voters needed calculation when determining replication status r=Santamaura a=Santamaura Currently, when a range has non-voting replicas and it is queried through replication stats, it will be reported as underreplicated. This is because in the case where a zone is configured to have non-voting replicas, for the over/under replicated counts, we compare the number of current voters to the total number of replicas which is erroneus. Instead, we will compare current number of voters to the total number of voters if voters has been set and otherwise will defer to the total number of replicas. This patch ignores the desired non-voters count for the purposes of this report, for better or worse. Resolves #69335. Release justification: low risk bug fix Release note (bug fix): use total number of voters if set when determining replication status Before change: ![Screen Shot 2022-02-11 at 10 03 57 AM](https://user-images.githubusercontent.com/17861665/153615571-85163409-5bac-40f4-9669-20dce77185cf.png) After change: ![Screen Shot 2022-02-11 at 9 53 04 AM](https://user-images.githubusercontent.com/17861665/153615316-785b156b-bd23-4cfa-a76d-7c9fa47fbf1e.png) 77315: backupccl: backup correctly tries reading in from base directory if l… r=DarrylWong a=DarrylWong …atest/checkpoint files aren't found Before, we only tried reading from the base directory if we caught a ErrFileDoesNotExist error. However this does not account for the potential error thrown when the progress/latest directories don't exist. This changes it so we now correctly retry reading from the base directory. We also put the latest directory inside of a metadata directory, in order to avoid any potential conflicts with there being a latest file and latest directory in the same base directory. Also wraps errors in findLatestFile and readLatestCheckpointFile for more clarity when both base and latest/progress directories fail to read. Fixes #77312 Release justification: Low risk bug fix Release note: none 77406: backupccl: test ignore ProtectionPolicy for exclude_data_from_backup r=dt a=adityamaru This change adds an end to end test to ensure that a table excluded from backup will not holdup GC on its replica even in the presence of a protected timestamp record covering the replica From a users point of view, this allows them to mark a table whose row data will be excluded from backup, and to set that tables gc.ttl to a very low value. Backups that write PTS records will no longer holdup GC on such low GC TTL tables. Fixes: #73536 Release note: None Release justification: low risk update to new functionality 77450: ui: add selected period as part of cached key r=maryliag a=maryliag Previously, the fingerprint id and the app names were used as a key for a statement details cache. This commits adds the start and end time (when existing) to the key, so the details are correctly assigned to the selected period. This commit also rounds the selected value period to the hour, since that is what is used on the persisted statistics, with the start value keeping the hour and the end value adding one hour, for example: start: 17:45:23 -> 17:00:00 end: 20:14:32 -> 21:00:00 Partially addresses #72129 Release note: None Release Justification: Low risk, high benefit change 77597: kv: Add `created` column to `active_range_feeds` table. r=miretskiy a=miretskiy Add `created` column to `active_range_feeds` table. This column is initialized to the time when the partial range feed was created. This allows us to determine, among other things, whether or not the rangefeed is currently performing a catchup scan (i.e. it's resolved column is 0), and how long the scan has been running for. Release Notes (enterprise): Add created time column to `crdb_internal.active_range_feeds` virtual table to improve observability and debugability of rangefeed system. Fixes #77581 Release Justification: Low impact observability/debugability improvement. Co-authored-by: Jane Xing <[email protected]> Co-authored-by: Santamaura <[email protected]> Co-authored-by: Darryl <[email protected]> Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]>
- Loading branch information