-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: replicas do not acquire a lease and quiesce after restart #83444
Comments
cc @cockroachdb/replication |
We discussed this in the KV weekly meeting and decided that we should first understand how it is that a fraction of the ranges are acquiring leases but the other fraction aren't. My intuition here is that its one of the queues deciding to process some replicas but not the others. In particular, I think its something like the consistency checker queue that only processes ranges once every 24 hours, so it won't reprocess the fraction of replicas that have already been checked in the last 24 hours. @kvoli one way to validate or invalidate that theory would be to disable the consistency checker queue in your reproduction and see if none or very few of the ranges end up getting leases. If so, we'll know this is all copacetic and maybe we can consider one of the approaches we discussed in the team meeting. |
Further InvestigationsRunning some more testing. With the consistency queue enabled and disabled, we see that the initial rise of leaseholders and quiescence is not in large part contributed to by the consistency queue. However after 70 minutes this ceases, and the cluster with the consistency queue enabled continues to add leaseholders at a slow pace whilst the cluster without adds 0. yellow = rate(raft log queue successes) With Consistency Queue Without Consistency Queue The initial increase, before stalling is correlated to the raft log queue processing successes. The cluster with the consistency queue enabled has a slightly higher rate of leases added, which can be attributed to the queue processing. When looking at the logs, the last SQL stats job completed at 14:39, which corresponds to when the rate of lease acquisition went to 0 for the cluster with no consistency queue.
It seems apparent then that only the SQL stats job and consistency queue are forcing lease acquisition and allowing replicas to quiesce, when there is no foreground activity on those ranges. ProblemSo from the above, it appears that cockroach has no mechanism specifically to quiesce replicas following a restart. Rather, we rely on foreground traffic, sql stats collection and queue processing to force lease acquisition, which enables quiescing when there's no activity following. We should instrument a method for ranges to quiesce after a restart that is independent and targets this specific problem. @aayushshah15 had two possible solutions, at a high level: forcing the lease acquisition on the raft leader, to enable quiescing or allow quiescing without needing a leaseholder for the range. |
I recall in the distant past that the replicate queue acquired the lease (if a range didn't hold the lease) in order to determine whether rebalancing/up-replication was required. Does that no longer take place? |
It will when processing, however it requires them to have a computed action for that range (store imbalance, decom, upreplication etc)
In the process loop it still does try and acquire a lease. refs Acquire a lease if possible when processing a replica: cockroach/pkg/kv/kvserver/replicate_queue.go Line 532 in c13a702
Manually enqueuing seems to also try and acquire if possible: cockroach/pkg/kv/kvserver/store.go Line 3376 in c13a702
|
This patch adds a synchronous call to acquire a lease when a replica is being checked for enqueing into the replication queue, if it satisfies necessary conditions. The necessary conditions are that it is currently the raft leader and that the lease status is expired. This ensures that following a node restart, a replicas with a valid lease will be instated within the replica scanner interval. resolves cockroachdb#83444 Release note: None
This patch adds a check that will enqueue replicas that do not have a valid lease into the replicate queue. The replicate and base queue already will attempt to acquire the lease when processing these replicas. This ensures that following a node restart, a replica with a valid lease will be installed within the replica scanner interval, for all ranges. resolves cockroachdb#83444 Release note: None
83597: Colocate auth logging with auth metric for consistency r=rafiss a=ecwall refs #83224 Release note (bug fix): Move connection OK log and metric to same location after auth completes for consistency. This resolves an inconsistency (see linked isssue) in the DB console where the log and metric did not match. 83731: kvserver: acquire replica lease on queue check r=nvanbenschoten a=kvoli This patch adds a check within the replication for when a replica is the raft leader and does not have a valid lease. The necessary conditions are that it is currently the raft leader and that the lease status is expired. This ensures that following a node restart, a replicas with a valid lease will be installed within the replica scanner interval. **single nodes 10k ranges with change** ![image](https://user-images.githubusercontent.com/39606633/176971656-317c38d3-7103-47a0-a18a-d9f29c49baa5.png) **5 node, 3k ranges** *without change* ![image](https://user-images.githubusercontent.com/39606633/177620933-56cfe528-c45c-429f-a4d9-9d3ba90fe8e1.png) *with change* ![image](https://user-images.githubusercontent.com/39606633/177621186-ee467043-47d5-4279-bb69-5478e7ad445a.png) resolves #83444 Release note: None 84044: ui: option to search exact statement on SQL Activity r=maryliag a=maryliag Previously, when doing a search on SQL Activity page, it was returning all statements that contained all terms from the search, but not necessarily on the same order. This commit adds an option when you wrap the search in quotes it will only return results with the exact match in order. https://www.loom.com/share/442c6eaee84b4c71a1acdef0b63b74bf Release note (ui change): Ability to search for the exact terms in order when wrapping the search in quotes. 84047: sql: remove unused error return value in a method of connExecutor r=yuzefovich a=yuzefovich Found while looking into #83935. Release note: None 84082: roachtest: skip multitenant/fairness r=cucaroach a=cucaroach Informs: #83994 Release note: None 84085: roachtest: fix zipping of artifacts to include other zips r=srosenberg a=renatolabs When artifacts are zipped in preparation for being published to TeamCity, other zip files are skipped. The idea is that we won't try to recursively zip artifacts.zip itself, or debug.zip, which is published separately. However, some tests (notably, `tpchvec`) download their own zip files in the `logs` directory so that they'll be available for analysis when a test fails. While there was an intention to skip only top-level zip files (as indicated by existing comments), the code itself would skip any zip files found in the artifacts directory. This commit updates the zipping logic to skip only toplevel zip files, allowing tests to write their own zip files to the `logs` directory and have them available for inspection later. Release note: None. Co-authored-by: Evan Wall <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Renato Costa <[email protected]>
This patch adds a check that will enqueue replicas that do not have a valid lease into the replicate queue. The replicate and base queue already will attempt to acquire the lease when processing these replicas. This ensures that following a node restart, a replica with a valid lease will be installed within the replica scanner interval, for all ranges. resolves cockroachdb#83444 Release note: None
This patch adds a check that will enqueue replicas that do not have a valid lease into the replicate queue. The replicate and base queue already will attempt to acquire the lease when processing these replicas. This ensures that following a node restart, a replica with a valid lease will be installed within the replica scanner interval, for all ranges. resolves cockroachdb#83444 Release note: None
This patch adds a check that will enqueue replicas that do not have a valid lease into the replicate queue. The replicate and base queue already will attempt to acquire the lease when processing these replicas. This ensures that following a node restart, a replica with a valid lease will be installed within the replica scanner interval, for all ranges. resolves cockroachdb#83444 Release note: None
Describe the problem
After restarting, the majority of replicas do not acquire a lease and remain ticking rather than quiescent. This is despite no foreground traffic being present that would tick the ranges.
The number of leaseholders still steadily increases, however at a rate of 0.02 per second, which in a cluster with a large number of ranges will take multiple days.
To Reproduce
Expected behavior
Ranges should quiesce after a restart, meaning they acquire their lease.
Environment:
Additional context
50-70% Idle CPU usage, mostly spent in ticking ranges.
Jira issue: CRDB-17073
Jira issue: CRDB-19917
The text was updated successfully, but these errors were encountered: