Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv,server: improve observability into "stuck" decommissioning process #74158

Closed
irfansharif opened this issue Dec 21, 2021 · 27 comments
Closed
Labels
A-kv-decom-rolling-restart Decommission and Rolling Restarts A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. T-kv KV Team

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Dec 21, 2021

Is your feature request related to a problem? Please describe.

Decommissioning a node can get stuck. Since it evacuates the node(s) being decommissioned, if there are no additional valid replication targets for the replicas it houses, we'll fail to make progress with little observable indication for as to why. The only progress indicator is a counter of the remaining replicas on the node being decommissioned, without any indications for what ranges these replicas are for or why we're unable to evacuate them.

  id | is_live | replicas | is_decommissioning |   membership    | is_draining
-----+---------+----------+--------------------+-----------------+--------------
   1 |  true   |       27 |        true        | decommissioning |    false
(1 row)
.........
  id | is_live | replicas | is_decommissioning |   membership    | is_draining
-----+---------+----------+--------------------+-----------------+--------------
   1 |  true   |       26 |        true        | decommissioning |    false
(1 row)
........
  id | is_live | replicas | is_decommissioning |   membership    | is_draining
-----+---------+----------+--------------------+-----------------+--------------
   1 |  true   |       25 |        true        | decommissioning |    false

Things typically get stuck due to (i) zone configs constraining replica movement, (ii) not having enough nodes in the cluster (decommissioning 1 node out of a 3 node cluster), or (iii) the replicate queue not processing the stuck replicas quickly enough. Diagnosing stuck decommissioning processes today entail requesting debug zips, scanning for logs, or enqueuing ranges manually after searching for the stuck ranges. All of this is time consuming and can be improved.

Describe the solution you'd like

Modelling decommissioning as a job: #74158 (comment).

Few ideas for data to surface:

  • Which replicas/ranges are stuck.
  • Why they're stuck (perhaps surfacing actual upreplication attempt errors from down in KV, or more verbosely logging the errors)
  • (optional) Automatic best-effort recovery (if we end up always enqueuing replicas by hand, there's no reason to not do it automatically).

Epic CRDB-10792

Jira issue: CRDB-11945

@irfansharif irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-decom-rolling-restart Decommission and Rolling Restarts labels Dec 21, 2021
@blathers-crl blathers-crl bot added the T-kv KV Team label Dec 21, 2021
@knz
Copy link
Contributor

knz commented Dec 21, 2021

more verbosely logging the errors

When @cameronnunez and I investigated this, we discovered that the node being decommissioned is not the node responsible for making placement decision, so increasing that node's logging verbosity to the maximum actually doesn't help.

So the logging output that's needed here needs to come from some other node in the cluster. Unsure whether logging is really the instrument that's going to help here.

@joshimhoff joshimhoff added the O-sre For issues SRE opened or otherwise cares about tracking. label Dec 21, 2021
@irfansharif
Copy link
Contributor Author

irfansharif commented Dec 21, 2021

Unsure whether logging is really the instrument that's going to help here.

Agreed, it's also desirable to not have to stare at logs to see what's stuck/why -- just wanted to record it for posterity. I'd like the outcome in this issue to be a structured output (powered by a pull-based API) that indicates why each stuck replica is stuck, even if in the short term we improve only logging for debuggability. For as frequent an operation as decommissioning, it should effectively be self-service.

@knz
Copy link
Contributor

knz commented Dec 21, 2021

cc @mwang1026 for tracking. The jira ticket is CRDB-11945.

@joshimhoff
Copy link
Collaborator

This ate up a bunch of SRE time today as well. Root cause in end was a zone config. Big +1.

@tbg
Copy link
Member

tbg commented Dec 22, 2021

I wonder if we should "just" use tracing and perhaps even jobs here. The idea being that the decommissioning process becomes a job and that it inherits the observability properties that come with that. (I think jobs also support getting traced, though I have only found this but I think there was more including fetching inflight traces for the job).
One difficulty (other than this requiring a good amount of planning and eng work) is to make sure that actions the replicate queue undertakes as a result of a node decommissioning are reported under the right trace. We could stash a TraceID on the liveness record (augmenting the Decommissioning field) but tracing currently has no model to send trace events to a remote gateway without relying on some existing means of talking to the root span. Still, depending on the state of traceability of jobs, it might be enough.

@joshimhoff
Copy link
Collaborator

I like that idea. Cloud will eventually invest in tracing infra. The more we use tracing the more that investment pays off.

@irfansharif
Copy link
Contributor Author

irfansharif commented Dec 22, 2021

I think jobs also support getting traced, though I have only found this but I think there was more including fetching inflight traces for the job

I think you're looking for this:

// crdbInternalClusterInflightTracesTable exposes cluster-wide inflight spans
// for a trace_id.
//
// crdbInternalClusterInflightTracesTable is an indexed, virtual table that only
// returns rows when accessed with an index constraint specifying the trace_id
// for which inflight spans need to be aggregated from all nodes in the cluster.
//
// Each row in the virtual table corresponds to a single `tracing.Recording` on
// a particular node. A `tracing.Recording` is the trace of a single operation
// rooted at a root span on that node. Under the hood, the virtual table
// contacts all "live" nodes in the cluster via the trace collector which
// streams back a recording at a time.
//
// The underlying trace collector only buffers recordings one node at a time.
// The virtual table also produces rows lazily, i.e. as and when they are
// consumed by the consumer. Therefore, the memory overhead of querying this
// table will be the size of all the `tracing.Recordings` of a particular
// `trace_id` on a single node in the cluster. Each `tracing.Recording` has its
// own memory protections via ring buffers, and so we do not expect this
// overhead to grow in an unbounded manner.
var crdbInternalClusterInflightTracesTable = virtualSchemaTable{
comment: `traces for in-flight spans across all nodes in the cluster (cluster RPC; expensive!)`,
schema: `
CREATE TABLE crdb_internal.cluster_inflight_traces (
trace_id INT NOT NULL, -- The trace's ID.
node_id INT NOT NULL, -- The node's ID.
root_op_name STRING NOT NULL, -- The operation name of the root span in the current trace.
trace_str STRING NULL, -- human readable representation of the traced remote operation.
jaeger_json STRING NULL, -- Jaeger JSON representation of the traced remote operation.
INDEX(trace_id)
)`,

@tbg
Copy link
Member

tbg commented Dec 22, 2021

I know about this one, but I thought there was something job-specific (probably built on top of the registry) to pull the job traces. I might be making that up though, perhaps @adityamaru knows?

@adityamaru
Copy link
Contributor

adityamaru commented Dec 22, 2021

I thought there was something job-specific

Yeah, it's only enabled for jobs that choose to be TraceableJobs and implement this stub interface:

type TraceableJob interface {

For jobs that opt-in we force a real span on Resume that can be fetched at any point during the execution of the job using ./cockroach debug job-trace. This will dump a trace zip that'll contain jaegar traces of all inflight spans spawned as part of this job's execution. Additionally, one can configure the job to dump its trace on reaching a terminal state i.e. success or failure.

I'm a big +1 on getting more users to use this framework since it will result in it reaching its full potential sooner. Ease of consumption of these traces, and real-world mileage into what is actually useful to debug a stuck job is still lacking.

@andreimatei
Copy link
Contributor

One difficulty (other than this requiring a good amount of planning and eng work) is to make sure that actions the replicate queue undertakes as a result of a node decommissioning are reported under the right trace. We could stash a TraceID on the liveness record (augmenting the Decommissioning field) but tracing currently has no model to send trace events to a remote gateway without relying on some existing means of talking to the root span.

In this decommissioning process we're talking about a single node, not multiple, right? Like, the relevant replication queue actions are happening on the node that's decommissioning, which makes the problem of attaching them to the right trace simpler, right?


I've been working to make tracing the answer to observing any ongoing process, as that process is running. I haven't been making a lot of noise because multiple pieces need to come together for that, but they're coming together. One such piece is a UI to show you traces of on-going operations, and the beginning of that is #74318. That page is simple at the moment, but the sky's the limit. Tracing is about to become awesome.
Things are far enough along such that I'd say to forget about what jobs add on top of tracing; let's make tracing enough (and nice) for use cases as the one discussed here. My hope is to make jobs themselves be sufficiently served by the vanilla tracing machinery (in fact, much better served then they are currently in the form of cockroach debug job-trace).

@knz
Copy link
Contributor

knz commented Jan 4, 2022

the relevant replication queue actions are happening on the node that's decommissioning

That's not what I've been seeing FYI.

@andreimatei
Copy link
Contributor

the relevant replication queue actions are happening on the node that's decommissioning

That's not what I've been seeing FYI.

Oh you mean for ranges with leaseholders not on the decommissioning node (which perhaps are all ranges cause the decommissioning node perhaps sheds the leases?), it's the replication queue on the leaseholder that needs to move the decommissioning replica away?
I see.

tracing currently has no model to send trace events to a remote gateway without relying on some existing means of talking to the root span.

I could help with this part. If we find a way to plumb a SpanContext to a remote operation, I'm thinking the node of the child span can push the child recording to the parent if the parent is recording.
Whether or not collecting the recordings of remote queue actions is the right answer here, I'm not sure though. An alternative would be to build some self-reflection into decommissioning process which would try to diagnose itself. In particular, if the number of replicas on the decommissioning node seems stuck, I'm thinking the diagnostics process would manually enqueue one of the ranges periodically to see what happens.

cameronnunez added a commit to cameronnunez/cockroach that referenced this issue Jan 24, 2022
…stalls

Debugging graceful drain is challenging due to limited information. It is not clear what
is going wrong from the logs. This patch adds more detailed logging when graceful drain
range lease transfer encounters issues. Specifically, range lease transfer attempt
information (when an attempt occurs, duration blocked on transfer attempt, and the lease
that is being transferred) is logged.

The reason behind the failure of the transfer attempt may not be logged on the
draining node. This is because this information lies on the node that serves the draining
request. Currently, we do not have logging that aggregates all of this information
in one place. See cockroachdb#74158.

Release note (cli change): If graceful drain range lease transfer encounters issues,
verbose logging is automatically set to help with troubleshooting.
craig bot pushed a commit that referenced this issue Jan 25, 2022
68488: server: add verbose logging when graceful drain range lease transfer stalls r=knz a=cameronnunez

Informs #65659.

Debugging graceful drain is challenging due to limited information. It is not clear what
is going wrong from the logs. This patch adds more detailed logging when graceful drain
range lease transfer encounters issues. Specifically, range lease transfer attempt
information (when an attempt occurs, duration blocked on transfer attempt, and the lease
that is being transferred) is logged.

The reason behind the failure of the transfer attempt may not be logged on the
draining node. This is because this information lies on the node that serves the draining
request. Currently, we do not have logging that aggregates all of this information
in one place. See #74158.

Release note (cli change): If graceful drain range lease transfer encounters issues,
verbose logging is automatically set to help with troubleshooting.

75271: sql,server: create toString to role option r=maryliag a=maryliag

Previously, the code was using roleOption.String() to
create a string to be used on queries. This was causing
an issue with roles that contained spaces within it,
e.g. role option `VALIDUNTIL` was being translate to
`VALIDUNTIL` instead of the correct way `VALID UNTIL`.
This commit updates the String() function to correctly
add the space on the `VALID UNTIL` case.

Release note (bug fix): Update the String() function of
roleOption to add a space on the role `VALID UNTIL`.

75293: backupccl: reuse allocated tenant prefix in rekey r=dt a=dt

Release note: none.

75303: migrations: make sure public role gets privs on public schema r=ajwerner a=rafiss

This matches up with how the public schema gets created in a
non-migration scenario.

Release note: None

Co-authored-by: Cameron Nunez <[email protected]>
Co-authored-by: Marylia Gutierrez <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@irfansharif
Copy link
Contributor Author

The idea of recording replicate queue attempts post-decommission into a system table came up internally, to help with observability. For posterity, here are a few reasons not to do that:

  • Would need to inject a dependency into KV server to be able to write using SQL, probably has to be best effort;
  • Can have write-amp/fast-garbage-accrual issues, be a bottleneck if decomm-ing many nodes;
  • Persisted only to communicate to decommission RPC and debug zip for posterity; latter is no better than structured logging, for the former it’d probably be better served with an explicit API without having to persist transient state to KV.

@irfansharif
Copy link
Contributor Author

irfansharif commented Jan 26, 2022

As a strawman for what a "pull based explicit API" could look like, when decommissioning a node, the CLI periodicially issues idempotent decommission requests in order to retrieve the decreasing replica counts on the target nodes:

cockroach/pkg/cli/node.go

Lines 441 to 446 in 0efe128

for r := retry.StartWithCtx(ctx, opts); r.Next(); {
req := &serverpb.DecommissionRequest{
NodeIDs: nodeIDs,
TargetMembership: livenesspb.MembershipStatus_DECOMMISSIONING,
}
resp, err := c.Decommission(ctx, req)

On the server side we're doing the trivial thing of scanning meta2 to determine what the replica count is:

cockroach/pkg/server/admin.go

Lines 2206 to 2232 in 785af46

// Compute the replica counts for the target nodes only. This map doubles as
// a lookup table to check whether we care about a given node.
var replicaCounts map[roachpb.NodeID]int64
if err := s.server.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
const pageSize = 10000
replicaCounts = make(map[roachpb.NodeID]int64)
for _, nodeID := range nodeIDs {
replicaCounts[nodeID] = 0
}
return txn.Iterate(ctx, keys.Meta2Prefix, keys.MetaMax, pageSize,
func(rows []kv.KeyValue) error {
rangeDesc := roachpb.RangeDescriptor{}
for _, row := range rows {
if err := row.ValueProto(&rangeDesc); err != nil {
return errors.Wrapf(err, "%s: unable to unmarshal range descriptor", row.Key)
}
for _, r := range rangeDesc.Replicas().Descriptors() {
if _, ok := replicaCounts[r.NodeID]; ok {
replicaCounts[r.NodeID]++
}
}
}
return nil
})
}); err != nil {
return nil, err
}

We know the "stuck" replicas, we also know where they're stuck. If our manual mitigation is always "queue stuck replica on the node it's stuck", then this would be one place to do it, and to report structured errors about why it's unable to proceed back up to the CLI to render. This is just a strawman, probably we could do other things. Making decommissioning a long-lived job with its internal logging/tracing is definitely another approach; I'm not opposed to it, just feels more heavy weight. Given the # of escalations that boil down to just doing the enqueue-on-right-replicate-queue-through-ui-for-errors dance, this feels like low hanging fruit to pick off.

@irfansharif
Copy link
Contributor Author

I see the server team recently improved logging for the drain process: #68488. In the very short-term, should we do something similar for the decommission process as well?

@joshimhoff
Copy link
Collaborator

Seems like a good short term change to me.

cameronnunez added a commit to cameronnunez/cockroach that referenced this issue Mar 2, 2022
…ssioning

Fixes cockroachdb#76249. Informs cockroachdb#74158.

This patch makes it so that when a decommission stalls the descriptions of
the "stuck" replicas are printed to the operator.

Release note (cli change): If decommissioning stalls, the replicas that are
failing to move are printed to the operator.

Release justification: low risk, high benefit changes to existing functionality
@exalate-issue-sync exalate-issue-sync bot added T-server-and-security DB Server & Security and removed T-kv-observability labels Mar 7, 2022
@exalate-issue-sync exalate-issue-sync bot assigned cameronnunez and unassigned zachlite Mar 7, 2022
@aayushshah15
Copy link
Contributor

This came up again in my discussions with @shralex and I'd like to re-iterate the proposal for maintaining a crdb_internal system table that maintains (a sample of) historical allocator traces.

I'm proposing a table that's essentially a circular buffer of traces up to either X hours in the past, or up to Y number of entries (per operation) -- just something to address unbounded growth if that's a concern. Often, we've been confused about the allocator's decisions in that we don't know why it isn't rebalancing replicas in the way we expect or why it is rebalancing replicas in some unexpected way. Maintaining an internal system table for retrospective debugging should address these issues and lets us answer questions like "why isn't this node being fully decommissioned" by defining predicates over this system table. We can also add traces from this system table to a debug zip to avoid back-and-forth with customers.

craig bot pushed a commit that referenced this issue Mar 28, 2022
76516: server: improve visibility of ranges that fail to move during decommissioning r=knz,aayushshah15 a=cameronnunez

Fixes #76249. Informs #74158.

This patch makes it so that when a decommission is slow or stalls, the 
descriptions of some "stuck" replicas are printed to the operator.

Release note (cli change): if decommissioning is slow or stalls, decommissioning
replicas are printed to the operator.

Release justification: low risk, high benefit changes to existing functionality

78433: ci: report failure to generate code as a test failure to teamcity r=rail a=rickystewart

Closes #78368.

Release note: None

78495: cluster: Revert "cluster: use WaitConditionNextExit" r=otan a=rickystewart

This reverts commit 6543749.
That commit was an (unsuccessful) attempt to fix #58955, and in the
presence of this change the `acceptance` tests are very likely to hang
forever under Ubuntu 20.04 due to a race condition where the container
exits before we begin waiting on it.

Release note: None

78525: dev: add `ui clean` subcommand, update `Makefile` to point to `dev` r=rail a=rickystewart

Add `ui clean` and `ui clean --all`; the former does approximately the
same as `make ui-clean`, the latter does approximately the same as
`make ui-maintainer-clean`.

Release note: None

78561: sqlsmith: fix deadlock during generation r=rharding6373 a=rharding6373

Fixes a deadlock when sqlsmith generates join expressions.

Fixes: #78555

Release note: None

78602: roachpb,kvserver: tweak ReplicaUnavailableError rendering r=erikgrinaker a=tbg

Release justification: minor UX tweakto an error message
Release note: None


78617: roachtest: wait for 3x replication in disk-full r=tbg a=nicktrav

Improve on #78456 by waiting fro 3x replication, rather than 2x.

Release note: None.

78629: ci: make sure we're streaming test output when `stress`ing via Bazel r=mari-crl a=rickystewart

Without this, you don't see the helpful "ran X iterations so far, Y
failures" messages from `stress`.

Release note: None

Co-authored-by: Cameron Nunez <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: rharding6373 <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Nick Travers <[email protected]>
@irfansharif
Copy link
Contributor Author

irfansharif commented Jun 6, 2022

Strawman for decommission: instantiate a job per node being decommissioned, and have that job code be responsible for flipping the liveness decommissioning bit or flipping it back if the job is being paused/canceled. It would also give you a place to stash decommission progress/observability state, and a great way to go fetch that data on demand — the job could enqueue replicas wherever to get whatever traces it wants and stash it (or some interesting subset of it) as part of the job record. Clean up of this observability state is well defined — it gets cleared when the job itself gets cleaned up. We don’t have to roll out a custom system table and worry about state management, we just have to make sure to not be creating too much GC when persisting this state — a concern we’d have even if we did roll out our own table. It also has the benefit of appearing in our UI.

Also, the way we finalize a decommission today is maybe a bit weird? We’re doing all this orchestration over in the CLI code where we keep issuing idempotent decommission RPCs and wait for a zero replica count before issuing a final RPC to flip the final bit:

cockroach/pkg/cli/node.go

Lines 530 to 559 in a21eab1

if !anyActive && replicaCount == 0 {
// We now drain the nodes in order to close all SQL connections.
// Note: iteration is not necessary here since there are no remaining leases
// on the decommissioning node after replica transferral.
for _, targetNode := range nodeIDs {
if targetNode == localNodeID {
// Skip the draining step for the node serving the request, if it is a target node.
log.Warningf(ctx,
"skipping drain step for node n%d; it is decommissioning and serving the request",
localNodeID,
)
continue
}
drainReq := &serverpb.DrainRequest{
Shutdown: false,
DoDrain: true,
NodeId: targetNode.String(),
}
if _, err = c.Drain(ctx, drainReq); err != nil {
fmt.Fprintln(stderr)
return errors.Wrapf(err, "while trying to drain n%d", targetNode)
}
}
// Finally, mark the nodes as fully decommissioned.
decommissionReq := &serverpb.DecommissionRequest{
NodeIDs: nodeIDs,
TargetMembership: livenesspb.MembershipStatus_DECOMMISSIONED,
}
_, err = c.Decommission(ctx, decommissionReq)

This part at least feels more reasonable to do as part of a job process, and lets us guarantee the decommissioning process proceeds even if the client goes away. Today if the client goes away, we'd still be upreplicating away from the decommissioning node (individual stores are still reacting to the liveness record having the "decommissioning" marker) but skip finalizing the bit, which is perhaps a paper cut. In escalations it shows up as nodes that have been long removed (but in the decommissioning) state that later stall upgrades, or show up in the UI as dead nodes. Some internal discussion here and here.

@irfansharif
Copy link
Contributor Author

To justify "lets us guarantee the decommissioning process proceeds even if the client goes away", quoting from an internal chat:

@mwang1026: I noticed in January that 5 support tickets (5!!!!!!!) were for upgrades that couldn't complete because a node was in decommissioning state and the user thought it had been fully decommissioned

@blathers-crl blathers-crl bot added the T-kv KV Team label Jun 16, 2022
@knz knz added T-kv-observability T-kv KV Team and removed T-server-and-security DB Server & Security T-kv KV Team T-kv-observability labels Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-decom-rolling-restart Decommission and Rolling Restarts A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. T-kv KV Team
Projects
None yet
Development

No branches or pull requests