-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv,server: improve observability into "stuck" decommissioning process #74158
Comments
When @cameronnunez and I investigated this, we discovered that the node being decommissioned is not the node responsible for making placement decision, so increasing that node's logging verbosity to the maximum actually doesn't help. So the logging output that's needed here needs to come from some other node in the cluster. Unsure whether logging is really the instrument that's going to help here. |
Agreed, it's also desirable to not have to stare at logs to see what's stuck/why -- just wanted to record it for posterity. I'd like the outcome in this issue to be a structured output (powered by a pull-based API) that indicates why each stuck replica is stuck, even if in the short term we improve only logging for debuggability. For as frequent an operation as decommissioning, it should effectively be self-service. |
cc @mwang1026 for tracking. The jira ticket is CRDB-11945. |
This ate up a bunch of SRE time today as well. Root cause in end was a zone config. Big +1. |
I wonder if we should "just" use tracing and perhaps even jobs here. The idea being that the decommissioning process becomes a job and that it inherits the observability properties that come with that. (I think jobs also support getting traced, though I have only found this but I think there was more including fetching inflight traces for the job). |
I like that idea. Cloud will eventually invest in tracing infra. The more we use tracing the more that investment pays off. |
I think you're looking for this: cockroach/pkg/sql/crdb_internal.go Lines 1246 to 1276 in 299abc8
|
I know about this one, but I thought there was something job-specific (probably built on top of the registry) to pull the job traces. I might be making that up though, perhaps @adityamaru knows? |
Yeah, it's only enabled for jobs that choose to be Line 144 in b407fdb
For jobs that opt-in we force a real span on I'm a big +1 on getting more users to use this framework since it will result in it reaching its full potential sooner. Ease of consumption of these traces, and real-world mileage into what is actually useful to debug a stuck job is still lacking. |
In this decommissioning process we're talking about a single node, not multiple, right? Like, the relevant replication queue actions are happening on the node that's decommissioning, which makes the problem of attaching them to the right trace simpler, right? I've been working to make tracing the answer to observing any ongoing process, as that process is running. I haven't been making a lot of noise because multiple pieces need to come together for that, but they're coming together. One such piece is a UI to show you traces of on-going operations, and the beginning of that is #74318. That page is simple at the moment, but the sky's the limit. Tracing is about to become awesome. |
That's not what I've been seeing FYI. |
Oh you mean for ranges with leaseholders not on the decommissioning node (which perhaps are all ranges cause the decommissioning node perhaps sheds the leases?), it's the replication queue on the leaseholder that needs to move the decommissioning replica away?
I could help with this part. If we find a way to plumb a |
…stalls Debugging graceful drain is challenging due to limited information. It is not clear what is going wrong from the logs. This patch adds more detailed logging when graceful drain range lease transfer encounters issues. Specifically, range lease transfer attempt information (when an attempt occurs, duration blocked on transfer attempt, and the lease that is being transferred) is logged. The reason behind the failure of the transfer attempt may not be logged on the draining node. This is because this information lies on the node that serves the draining request. Currently, we do not have logging that aggregates all of this information in one place. See cockroachdb#74158. Release note (cli change): If graceful drain range lease transfer encounters issues, verbose logging is automatically set to help with troubleshooting.
68488: server: add verbose logging when graceful drain range lease transfer stalls r=knz a=cameronnunez Informs #65659. Debugging graceful drain is challenging due to limited information. It is not clear what is going wrong from the logs. This patch adds more detailed logging when graceful drain range lease transfer encounters issues. Specifically, range lease transfer attempt information (when an attempt occurs, duration blocked on transfer attempt, and the lease that is being transferred) is logged. The reason behind the failure of the transfer attempt may not be logged on the draining node. This is because this information lies on the node that serves the draining request. Currently, we do not have logging that aggregates all of this information in one place. See #74158. Release note (cli change): If graceful drain range lease transfer encounters issues, verbose logging is automatically set to help with troubleshooting. 75271: sql,server: create toString to role option r=maryliag a=maryliag Previously, the code was using roleOption.String() to create a string to be used on queries. This was causing an issue with roles that contained spaces within it, e.g. role option `VALIDUNTIL` was being translate to `VALIDUNTIL` instead of the correct way `VALID UNTIL`. This commit updates the String() function to correctly add the space on the `VALID UNTIL` case. Release note (bug fix): Update the String() function of roleOption to add a space on the role `VALID UNTIL`. 75293: backupccl: reuse allocated tenant prefix in rekey r=dt a=dt Release note: none. 75303: migrations: make sure public role gets privs on public schema r=ajwerner a=rafiss This matches up with how the public schema gets created in a non-migration scenario. Release note: None Co-authored-by: Cameron Nunez <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
The idea of recording replicate queue attempts post-decommission into a system table came up internally, to help with observability. For posterity, here are a few reasons not to do that:
|
As a strawman for what a "pull based explicit API" could look like, when decommissioning a node, the CLI periodicially issues idempotent decommission requests in order to retrieve the decreasing replica counts on the target nodes: Lines 441 to 446 in 0efe128
On the server side we're doing the trivial thing of scanning meta2 to determine what the replica count is: Lines 2206 to 2232 in 785af46
We know the "stuck" replicas, we also know where they're stuck. If our manual mitigation is always "queue stuck replica on the node it's stuck", then this would be one place to do it, and to report structured errors about why it's unable to proceed back up to the CLI to render. This is just a strawman, probably we could do other things. Making decommissioning a long-lived job with its internal logging/tracing is definitely another approach; I'm not opposed to it, just feels more heavy weight. Given the # of escalations that boil down to just doing the enqueue-on-right-replicate-queue-through-ui-for-errors dance, this feels like low hanging fruit to pick off. |
I see the server team recently improved logging for the drain process: #68488. In the very short-term, should we do something similar for the decommission process as well? |
Seems like a good short term change to me. |
…ssioning Fixes cockroachdb#76249. Informs cockroachdb#74158. This patch makes it so that when a decommission stalls the descriptions of the "stuck" replicas are printed to the operator. Release note (cli change): If decommissioning stalls, the replicas that are failing to move are printed to the operator. Release justification: low risk, high benefit changes to existing functionality
This came up again in my discussions with @shralex and I'd like to re-iterate the proposal for maintaining a I'm proposing a table that's essentially a circular buffer of traces up to either X hours in the past, or up to Y number of entries (per operation) -- just something to address unbounded growth if that's a concern. Often, we've been confused about the allocator's decisions in that we don't know why it isn't rebalancing replicas in the way we expect or why it is rebalancing replicas in some unexpected way. Maintaining an internal system table for retrospective debugging should address these issues and lets us answer questions like "why isn't this node being fully decommissioned" by defining predicates over this system table. We can also add traces from this system table to a debug zip to avoid back-and-forth with customers. |
76516: server: improve visibility of ranges that fail to move during decommissioning r=knz,aayushshah15 a=cameronnunez Fixes #76249. Informs #74158. This patch makes it so that when a decommission is slow or stalls, the descriptions of some "stuck" replicas are printed to the operator. Release note (cli change): if decommissioning is slow or stalls, decommissioning replicas are printed to the operator. Release justification: low risk, high benefit changes to existing functionality 78433: ci: report failure to generate code as a test failure to teamcity r=rail a=rickystewart Closes #78368. Release note: None 78495: cluster: Revert "cluster: use WaitConditionNextExit" r=otan a=rickystewart This reverts commit 6543749. That commit was an (unsuccessful) attempt to fix #58955, and in the presence of this change the `acceptance` tests are very likely to hang forever under Ubuntu 20.04 due to a race condition where the container exits before we begin waiting on it. Release note: None 78525: dev: add `ui clean` subcommand, update `Makefile` to point to `dev` r=rail a=rickystewart Add `ui clean` and `ui clean --all`; the former does approximately the same as `make ui-clean`, the latter does approximately the same as `make ui-maintainer-clean`. Release note: None 78561: sqlsmith: fix deadlock during generation r=rharding6373 a=rharding6373 Fixes a deadlock when sqlsmith generates join expressions. Fixes: #78555 Release note: None 78602: roachpb,kvserver: tweak ReplicaUnavailableError rendering r=erikgrinaker a=tbg Release justification: minor UX tweakto an error message Release note: None 78617: roachtest: wait for 3x replication in disk-full r=tbg a=nicktrav Improve on #78456 by waiting fro 3x replication, rather than 2x. Release note: None. 78629: ci: make sure we're streaming test output when `stress`ing via Bazel r=mari-crl a=rickystewart Without this, you don't see the helpful "ran X iterations so far, Y failures" messages from `stress`. Release note: None Co-authored-by: Cameron Nunez <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: rharding6373 <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Nick Travers <[email protected]>
Strawman for decommission: instantiate a job per node being decommissioned, and have that job code be responsible for flipping the liveness decommissioning bit or flipping it back if the job is being paused/canceled. It would also give you a place to stash decommission progress/observability state, and a great way to go fetch that data on demand — the job could enqueue replicas wherever to get whatever traces it wants and stash it (or some interesting subset of it) as part of the job record. Clean up of this observability state is well defined — it gets cleared when the job itself gets cleaned up. We don’t have to roll out a custom system table and worry about state management, we just have to make sure to not be creating too much GC when persisting this state — a concern we’d have even if we did roll out our own table. It also has the benefit of appearing in our UI. Also, the way we finalize a decommission today is maybe a bit weird? We’re doing all this orchestration over in the CLI code where we keep issuing idempotent decommission RPCs and wait for a zero replica count before issuing a final RPC to flip the final bit: Lines 530 to 559 in a21eab1
This part at least feels more reasonable to do as part of a job process, and lets us guarantee the decommissioning process proceeds even if the client goes away. Today if the client goes away, we'd still be upreplicating away from the decommissioning node (individual stores are still reacting to the liveness record having the "decommissioning" marker) but skip finalizing the bit, which is perhaps a paper cut. In escalations it shows up as nodes that have been long removed (but in the decommissioning) state that later stall upgrades, or show up in the UI as dead nodes. Some internal discussion here and here. |
To justify "lets us guarantee the decommissioning process proceeds even if the client goes away", quoting from an internal chat:
|
Is your feature request related to a problem? Please describe.
Decommissioning a node can get stuck. Since it evacuates the node(s) being decommissioned, if there are no additional valid replication targets for the replicas it houses, we'll fail to make progress with little observable indication for as to why. The only progress indicator is a counter of the remaining replicas on the node being decommissioned, without any indications for what ranges these replicas are for or why we're unable to evacuate them.
Things typically get stuck due to (i) zone configs constraining replica movement, (ii) not having enough nodes in the cluster (decommissioning 1 node out of a 3 node cluster), or (iii) the replicate queue not processing the stuck replicas quickly enough. Diagnosing stuck decommissioning processes today entail requesting debug zips, scanning for logs, or enqueuing ranges manually after searching for the stuck ranges. All of this is time consuming and can be improved.
Describe the solution you'd like
Modelling decommissioning as a job: #74158 (comment).
Few ideas for data to surface:
Epic CRDB-10792
Jira issue: CRDB-11945
The text was updated successfully, but these errors were encountered: